CBMM10 Panel: Neuroscience to AI and back again
Date Posted:
November 6, 2023
Date Recorded:
October 7, 2023
CBMM Speaker(s):
James DiCarlo ,
Ila Fiete ,
Nancy Kanwisher ,
Christof Koch ,
Gabriel Kreiman Speaker(s):
Talia Konkle
All Captioned Videos CBMM10
Description:
Review of progress and success stories in understanding visual processing and perception in primates and replicating it in machines.
Key open questions. Synergies.
Panel Chair: J. DiCarlo
Panelists: I. Fiete, N. Kanwisher, C. Koch, T. Konkle, G. Kreiman
TOMASO POGGIO: If you look at evolution, evolution went from probably very simple associative reflexes to language, to logic, eventually to large language models. And in a sense, these are back to association, so from programming to learning and associating things, so I guess neuroscience to AI and back again. That's it. Jim?
JAMES DICARLO: OK. All right. Thanks, Tommy. Can you guys hear me OK? OK, I have this amazing panel that we're going to sit with today. I'll introduce them in a minute, and I'll let them all say what they want to say. But I just wanted to whet the appetite of what I thought this session would mostly be about. I did that a little bit yesterday, but let me try again.
I showed you guys these slides yesterday, and I often take them as facts, but I know when I present them, they're controversial facts. So that's, in some sense, what we'll discuss. So I kind of showed you this idea that a lot of the successful interplay in the last 10 years has been especially in sensory systems. And I've been fortunate to be involved in this visual part of this. This is the ventral visual stream in a series of cortical areas that are thought to be involved in tasks like object recognition and other cognitive functions.
All I would wrap under the umbrella of visual intelligence, broadly construed. And I made the argument that these have both inspired, informed, directed, the development of certain types of deep architectures. This is a simple feedforward deep convolutional network. Of course, models have advanced since then. But this has just conceptually fixed the idea.
And these models, I argued, were also some of the-- became over the last decade, starting about 10 years ago, the leading models in aspects of computer vision, and then were generalized to other aspects of "AI," I said, by just general deep learning. So going from deep architectures to deep learning. I mostly focused on the deep architecture, especially the convolutional part and how that relates to sensory processing.
And there's been this successful interplay, again, an informing of these models as something we should pay attention to as conceptual ideas that then get built by engineers, optimize the parameters that then come back to actually start to explain the things that we were observing up here that we couldn't easily explain before. So I think of that as a virtuous loop that inspires this way, engineering tools to optimize, come back, and you seem to explain a lot of things that were hard to explain.
And I argue that that loop continues today. And this actually motivates a big part of the overall quest, brains, minds and machines process that I described is that you inform, you can build, and then you come together around models that serve again as both scientific hypotheses and as technology drivers. And this is reiterating what I just said that these are technology drivers, deep architectures for deep learning. And in this audience, the scientific mode, these are actually leading hypothesis, I said, of at least the first 200 milliseconds after an image comes on.
And that was, in some sense, where these have been most studied, at least in the visual system. And then I went on to argue about the limitations of, well, this is just the first 200 milliseconds. Should we really think about deep architectures? What does that all tell us? This is an area where I think CBMM has made, I think, big inroads in being a leader in the field, especially in transforming the natural sciences, especially primate and human vision, as again, this has changed the way people do business around visual processing, especially central visual processing, and again, has had impact on the technology side.
The panel we have today is really to try to unpack this for you a bit, maybe both historically, what have we really learned, and then looking forward, where might this be pointing us? And this continues today. This doesn't loop, doesn't run once. It continues. So I'm trying to motivate in this panel a set of questions, and I'm going to outline them for you here, and then we'll zoom in a bit and give the panelists a chance to say something.
So one stepping back is like, well, Jim puts this nice slide up. What, if anything, does this actually tell us about visual processing and visual intelligence? Does it tell us anything? Does it only tell us about adult mechanisms? Can we even use the word mechanism when we describe these models as describing data that we're observing in the system? So we'll discuss that.
What does this not tell us about the processing and visual intelligence? Often, these models are not-- they're learning mechanisms. They're not thought to be anything like the biological learning mechanisms. And I think that will probably be part of this as just one example. I think our panelists will speak to this.
And I think even perhaps most interestingly, stepping away from vision and sensory systems, what lessons can and should be generalized to how we do science in this area of the interface between artificial built systems, often optimized by things that are themselves difficult to understand, and natural systems, which are complicated objects that we would love to understand in some meaningful way? And that's, of course, an active discussion in the field.
And this is just one area which is then informing, how do we go forward to other domains of intelligence? And I think our panel will touch on that, as well. So those are the three big picture questions that I hope to motivate when I come back to the questions. The panelists are the ones that are going to do all the work, and I'll try to keep things exciting and interesting. They will each introduce themselves.
We have a mixture of people that do-- I think everybody here touches computation. Everybody does something that I would call computational neuroscience. We have people that are more on the experimental side and people that are more on the model-building side. And you'll hear that from each of them. We also span things from rodents all the way up to humans in this group, so rodents, primates, and humans. Again, you'll hear that in their introductions.
These are the questions I'm going to ask them at the end. And I ask them ahead of time, and I'm going show you their answers. And they don't all agree, so that'll motivate some discussion. I'll just whet your appetite here. Some but not all deep ANN processing systems are also the current best models of the mechanisms by which primates and humans process sensory input, yes or no. Let's discuss that. That gets to the question of mechanism and what we mean by that.
Second, the most important discoveries from neuroscience and cognitive science have already been incorporated in these models. Or do other things need to be incorporated? And if so, what kind of things need to be incorporated? And if so, what would that change in terms of our understanding? Of course, that's speculative, but I'm asking people to predict, if they want to say that, what they're proposing there.
And then about models in general-- this is over the next decade-- will the visual and sensory neuroscience migrate away from deep ANN models towards some other type of models? And if so, what kind of models are we imagining? And I'd like people to speculate on that. And of course, as I mentioned, this should go beyond sensory systems. This is just to ground us in a start point to ground discussion.
And I'll come back to those questions if they went by too quick. I'll put them up, and then maybe you can guess what people would say for each of those. But I'll show you where their start positions are and let them discuss that as a panel. I'm going to now give each of the panelists five minutes to say something in the general frame that did. I don't know exactly what they're each going to say, but that's OK. We'll start with Ila Fiete here from MIT. So Ila, all yours.
ILA FIETE: So what I wanted to just do was to whet your appetite on one direction or one argument that can be made in the panel, and that is that one of the things that we're really missing is how structure and function emerges rather than is trained in. So what do I mean by that? So here's, again, Jim's slides. So on the one hand, we've got the visual stream, visual processing hierarchies.
The circuits are modular. They're also hierarchically organized. And on the other hand, we've got these circuits that have partially been inspired by neuroscience, but are these machine learning circuits. Now, to build these, so first of all, a couple of comments is that this is not just true for the visual system, this modular hierarchical organization. But it's also true as a general motif in sensory systems. And over here the question is, here we're building a lot of this in. we. Have to empirically try to build these things in when we build machines.
Here we can study single species one by one and study the organization. And the question is, what can we say about what happens in biology and what is left to incorporate from biology into machine learning? So one thing to note is that this modular organization in biology is ubiquitous. There are also cognitive areas, grid cells. This is development. This is how the fly gets its beginnings and the embryo before these colored bands turn into segments in the body. And even in cortical development, there's a laminar organization of cortex that starts to happen very early on.
So here's a slide that was a bit of a meme from Yann LeCun. In the earlier days he used to argue that a lot of learning that was done at the time, reinforcement learning-- of course, we're back full circle now, but he used to argue that reinforcement learning was a cherry on the cake, and supervised learning is like the icing on the cake, and most of the cake is actually unsupervised learning. So that was the argument.
But of course, as a biologist, I would say that, well, where is the cake, mix coming from? And so the biologists would argue the cake, mix is actually genetics. So we've got our genes that eventually set up the circuit that are the cake. But on the other hand also, I would argue now as a physicist who studies biological systems and dynamics in biological systems in the brain, I would say that this is not the whole story because anybody who's ever eaten cake and used cake mix knows there's a huge difference between the powdery cake mix and the actual cake. So what's the gap between cake mix and the cake?
OK, so this is the computer scientist's view. This is the geneticist's view. And then here's the view of someone like me, which is that there is this whole process of emergence where you go from the ingredients, which are the cake mix. And then you have to put in water and heat and time, and all this rich stuff happens. And then you get this fluffy, soft, tasty thing. That's cake. And so this is the idea of emergence. So in other words, we can think about evolution as this bottleneck, this funnel, that shapes your genes.
But then the genes are a very tiny amount of information that they provide for the structure, and then all of the process of emergence is this whole red part here, which then gives rise to the brain. So we're studying questions about how very, very simple growth processes based on spontaneous activity can lead to structure emergence in the brain. And so you need very little.
So just by way of spontaneous activity, maybe in the embryonic animal before it has any visual stimuli, can then, together with some competitive growth rules, can then lead to the organization of modular structures that are connected hierarchically. And moreover, they have all kinds of topographic organization and patterning, as well as feedforward and recurrent connectivity. And in fact, the structures that emerge have all these hallmarks of beautiful convolutional architectures, and using the same growth rules but tuning that only have three parameters or four parameters.
And you can just tweak those parameters, and you can go from a primate visual system to something that looks more like a mouse visual system. And then you can apply the same rules and go from something that looks like a visual processing hierarchy to an auditory processing hierarchy. So I guess my argument would be that development is the emergence process, self-organization, over development is the place where we might find a lot of fruitful ways to bridge those two areas. Thanks.
JAMES DICARLO: Wow, thank you. OK.
[APPLAUSE]
JAMES DICARLO: So we have next up is Nancy Kanwisher, also from MIT. Nancy?
NANCY KANWISHER: It is surprising and amusing to me to be standing here talking in this panel. About five years ago, I would have never thought I would land here. I tried for the longest time to hold my head down and wait for all this stuff to blow over, but I guess it's not blowing over. And actually, as I look up, it's pretty damn interesting. Great, thank you. That's perfect.
So what I want to do is just briefly sketch some of the reasons why I think this whole enterprise is pretty interesting. So first, as Jim has already noted, the import of artificial neural network models of the ventral visual pathway, first, most obvious import is that these are the first time we've had really plausible image computable models that kind of work of how object recognition might work in the brain.
And as background, Jim and I taught this graduate seminar about 15 years ago for, actually, many years on object recognition. And at the time, the state of the models of object recognition were kind of like this. Well, first maybe you find some edges, and then maybe somehow you put those edges together and you find parts of objects. And then somehow you represent the relationship of parts to each other, and that's kind of like a description of object shape. And you compare that to something in memory, and that's object recognition.
That was kind of the state of the theories at the time. And when Jim and I taught this course, I figured since that's all we knew how to say, that we would focus on these side issues, like OK, how is it modulated by attention? And can you activate the system with mental imagery? And Jim kept saying, no! No, no, no. Let's focus on the actual problem of object recognition.
I'm like, how? And he said, well, there's these things called GPUs. And I didn't know what the hell he was talking about. Anyway, that's just as background that it's actually, I think, really big stuff and important that we now have actual computational theories of what might be going on in the brain.
And as Jim has already alluded to, and as I'm sure you guys all know, the evidence so far from lots of other labs is that, actually, there's an astonishing degree of fit between many of these artificial neural networks trained on, say, ImageNet and activations you see in the brain. And I think that's super cool and kind of profound, in a way. It did not have to be that these systems optimized through totally different methods, evolution and learning versus backprop, would end up at even remotely similar solutions.
So I think that's super interesting. It's also really useful. So Ratan Murty, who was working with me and Jim a bunch of years ago, said oh, yeah, we can do this on whole regions of the brain. We can build a CNN-based model of the fusiform face area. I'm like, you can? How? Why? What would that tell you? And then he did it. And he basically showed using the usual methods that you can take some linear weighted combination of units in a CNN and fit that to the mean response of the whole fusiform face area.
And when you test that model's response on held-out images-- I know this is really tiny, but this is just showing you a damned high correlation, 0.91, between the predicted response of the FFA to a new image and the observed response. I'm like, whoa. That's pretty cool. So now we have something useful. We have a proxy model of the face area response. So first thing we did is run it over a weekend on millions of images to see if we could find any images that it would predict a strong response to that were faces we couldn't.
That was cool. There's lots of uses of these models as proxy models. But I think even more interesting to me is that I think we can use these models to ask why questions about brains, why questions that I never thought we'd be able to address just a few years ago. So for example, Caterina Dobbs started in my lab a few years ago to ask, why is it a good way to design a brain with the modular organization we see with separate patches of brain for recognizing faces, and places, and bodies, and words, and other stuff?
Why should brains be designed like that? And so she set out to test this question in a number of ways. But the cool finding was that when she trained a standard CNN on a few hundred object categories-- this is supervised training-- a few hundred object categories, a few hundred face categories, what she found was that CNN did great at recognizing novel faces and novel objects when you later test it.
But then when you look inside and ask, how did it do that, you see that it has spontaneously segregated itself into two separate systems that increased their segregation as you go up the stages of the network with units at the later layers that were causally involved almost entirely only in face recognition, not object recognition, and the other way around, just like we see in the brain-- a double dissociation in the network.
And I think that's really fascinating because it says, maybe the reason we have these separate systems in the brain is that these are just fundamentally different computations or different feature sets. And so it makes sense to segregate them in any computational system that's going to have to solve all those problems. And I think that's super exciting.
And in more recent work, my lab, and also Tala is doing very similar stuff using self-supervised networks and finding more generally similar patterns of segregation in networks and in brains. And that's super interesting. Also about this spontaneous segregation, the discussion in yesterday's panel about embodiment-- well, yeah, you need to have a motor system, and you need to do stuff, and it matters for evolution, and it matters for development.
But let's not go overboard. These networks don't have really any embodiment at all. They're just fed a bunch of images, particularly the self-supervised ones. They have no knowledge of the meaning of those images. They have no post-processual use of that information. They have no body they're moving around the world. Yet they discover a lot of the same structure. So it's like a-- it doesn't mean that's how it arises in brains, but it means that in principle, some of that organization can arise without all the broader brain and body that the system is-- oops. Oh, wow. OK.
We find similar spontaneous emergence of behavioral signatures of the face system in networks. Now, how seriously should we take these models as models of the brain? I don't know. I've worried about this a lot. I worry that taking some linear weighted combination of units from some huge network and fitting some brain response is too easy. There's a whole lot of units there. That's too easy. And to what extent does that mean we should say, that network mirrors what we see in the brain? I don't know.
But Meenakshi Khosla here has been working on a much higher bar for similarity of networks and models, and it's a return to kind of classical views of neuroscience where we used to believe, OK, there's Hubel and Wiesel. There's an orientation detector. It has a tuning. That's what that neuron means. It's tuned to that orientation. That's been thrown away with modern methods where people use linear decoding methods, regression methods, RSA.
All of those things are blind to the actual tuning of individual units. And what Meenakshi has been showing is that actually, brains are aligned 1 to 1 at the neural level from one person to another. Networks trained on natural images are also aligned 1 to 1. And brains are aligned 1 to 1 to networks. And this gives us, I think, a much more literal mapping between networks and brains. And it also gives us a stronger test for alignment of networks and brains.
For example, transformer models are-- with the standard methods, they fit the ventral pathway as well as CNNs do, but not with 1 to 1 alignment. They're not as good, so it gives us a stronger test of the different models. And the final thing I'll say, especially since Josh is here-- Josh Tenenbaum-- lest I have too much CNN triumphalism-- we can't have that.
I think actually, one of the most important things we can learn from CNNs is what they can't do, and there's a lot more to vision than just pattern recognition. And so it will be an interesting empirical question going forward to figure out which of those things these models can do and which they can't.
JAMES DICARLO: OK, great. Thanks. Applause for Nancy.
[APPLAUSE]
JAMES DICARLO: All right, let's see. OK. Christof, I think you're up. So Christof Koch from Alan and Tiny Blue Dot.
CHRISTOF KOCH: Thank you. Thank you, Jim. Thank you very much. I'm here because many, many years ago I was the first graduate student of Tommy. And then I did my postdoc here in the last century. And I've been connected ever since then. I'm here to make three points about biology and about humans. So one is the existence of cell types, which ANN and CNN and modern transformers do not at all respect.
So we now know this is in the mouse. We find something very similar in non-human primate. And next week we have a large issue coming out in science, special issue, making similar points in human brains. So we all know the different types of cells, right? But there are a lot of different types. So the current count is 5,200 cells organized in this taxonomy using five different levels. This is using merfish together with Charlie Huang at Harvard and RNA seq having 7 million neurons.
So it's a very, very deep network. You can access all of that and all the data online. So all cells are grouped into-- for any one area, there are roughly 100 different cell types. This is particular in M1. And once again, we find something very similar-- not identical of course, but very similar-- in human and non-human primates. So this is not particular to the mouse. In fact, you can really align the transcriptional hierarchy between the different species.
And it's remarkable how much of the taxonomy is conserved. So here, on the left we show just for M1 using different techniques. There are 39 excitatory cell types, 42 GABAergic cell types, and then 14 other non-neuronal cell types. On the right, we only show very nicely-- you can see this very nice spatial angiography going from layer 1 all the way to the bottom of layer 6 here, down here, layer 6b, that are transcriptional aligned with underneath the classroom.
You can see, so here we only show in color. These are only the intratelencephalic pyramidal cells, excitatory cell here. There we show all the ones here these colors. The colored ones are only the ones that project to other cortical areas. These are ipsilateral or contralateral. And in human-- and by the way, this is expanded. So there are many more human supralamina excitatory cell types in human and layer 2, 3 compared to the mouse.
Now, all of these differ by the genes expressed. So typically here we have 8,000 differentially expressed genes. So we have these very high-dimensional spaces in which we find these clusters. And they differ by the electrophysiology, particularly by the way they project, the zip codes they carry. Where do they send the information? And so the belief in the field, of course, is right now that to understand anything in biology, particular in normal functioning and in disease, it's critical to know what type of cell you're recording from.
Nothing in the retina makes sense without knowing-- retinal physiologists don't say, well, I recorded from 200 retinal ganglion, from 200 cells in the retina. You want to know what type of cell, what type of retinal ganglion cell, where does it project to because that's absolutely essential to understand the function of that particular type of cell in the retina.
That's no different in cortex. We want to understand this cortical tissue or any other tissue we need to understand, from which type of cell we are recording from, what type of information is processed, and where is it being sent to? I mean, the idea here is that these two different cell types, because they're different, they each have their own axon. They send their information to different places.
That's what it looks like if you do the three-dimensional reconstruction that you can now do in the mouse, because otherwise you would have a single axon that goes to both places simultaneous. So that's point number one. It's really critical to understand about cell types. Point number two is-- I decided not to show a slice.
So you all know we built these large-scale brain observatories where you can now routinely record using neural pixel from 100,000 cells, typically, 200,000 cells, where we can do very large-scale surveys and try to model the actual response of those cells against model. And what we found in the mouse is that the number of cells that do either simple or complex type computation is vanishingly small.
You can find some cells that do it very well. But typically that's to 1% or 2% or 3% of the cells where you can model them with an r squared, let's say, of bigger than 0.5. The vast majority of cells don't follow that. So it makes sense, if you look at the receptive field, they don't have simple types or complex types of receptive field. And of course, it also makes sense if you look at the vast connectivity.
In particular, the majority of synapses, as we all know, is most models today ignore comes from sideways connection or from feedback connection. And then lastly, a point I always make-- I feel like Cato the younger-- that ultimately, if we want to say we understand the visual system or any other auditory or any other sensory system, we have to understand its key property in humans, and presumably in all animals, which is that we not only behave, but we see. We actually see.
We have conscious visual experiences or conscious otherwise sensory experiences. And that's a physical-- I mean, that's a property I think about. It's a physical property of certain types of systems, and we ultimately, as scientists, we need to understand how those specific feelings arise from activity in this particular type of cortical tissue. Thank you.
JAMES DICARLO: Thank you.
[APPLAUSE]
JAMES DICARLO: Thank you, Christof. OK, let's see. Next up we have Talia Konkle from Harvard University. Talia, all yours.
TALIA KONKLE: I actually am-- I like where I'm positioned in the argument because Christof just laid out a bunch of details of biology that make you go, how are we going to connect these to CNN's? Is there a huge gap? How do we fill that? And so I think there's a way we can think about models and what they're doing that can help us navigate that conundrum.
So there's one way of thinking about models that I think Jim highlighted, which is to think of them as models of the visual system in a way that has a tight correspondence. V1 matches layer 1. V2 matches layer 2. And the aim on this formulation is to add more biological detail to make it a better and better correspondence, making sure it matches with the data and phenomena, with the goal of driving to the single best model that's the most brain-like model.
But another way of thinking about these models is as model organisms. Some systems neuroscientists study mice, non-human primates, treeshrews, even the fly, even the Drosophila. And the premise here is that there's something that mice and flies and monkeys share that makes them informative to understand the human biological system or biological systems in general, in some ways, but not all ways.
And we're not confused by that. We're OK with that logical inference. And in this case, the onus is on the experimenter to motivate which bits you should pay attention to and which bits you don't need to and why, to make an inference about what bit of computation is giving rise to some capacity-- some representational signature, some behavioral capacity.
And on this framing, the goal is actually to understand the model in and of itself and to understand which mechanisms have which kinds of consequences. So there are researchers who spend their time just trying to understand the fly, not even necessarily even worrying how the fly relates to the human yet, because the promise is there will be deep underlying principles about how information is transformed and passed on and converted between sensory and motor in a loop with respect to goals, and that that's going to translate over from across systems.
And I think the same exact thing is true for models. So I am a systems neuroscientist of the alexnet. I love alexnet models. I like resnets and VGGs and transformers, too. I study them, too. But I really like alexnet. I spent lots of time dissecting its circuit and visualizing it. And the idea is, I think it's actually valuable to understand that model in and of itself. How is it working? How is it taking the image content, converting it through its hierarchical stages into a format that can tell apart cars and shoes and chairs and faces and people?
And even understanding that, regardless of whether it's human-like or not, is valuable. It gives us possible answers. And then the challenge is on us to go, I think these are the critical mechanisms that are giving rise to that. And I think those are mimicked in the biological system in these ways. And that's the work to do.
And so in this way, actually, we have the mice-- you know, normally the biological scientists are limited by their tools, right? We have to spend all-- we have to do all our engineering to figure out, where can we measure, or put the person in the scanner? Mice, they boast, we have all the optogenetic tools. Well, the systems neuroscientists of a machine learning model has the best tools of them all. We have all the tools.
There are engineering problems. You have to figure out how to get the model in RAM and all the things. There's still plenty of engineering problems to solve. But we can do everything. We can measure everything. We can redirect circuits. We can inject patterns. We can lesion. We can lesion-- we have reversible lesion. We can do all of it. And so it's paradise.
So to the systems neuroscientists of model cognition, we're in a playground, and we've only just started. And I think to Nancy's point, she's saying there's these 1 to 1 correspondences between some of the tuning we're seeing in these models and the brains. And I think that's not because these are models of the system in the first way.
I think that's because there are deep ways that information needs to get routed and that all the systems are doing. And that's actually a very deep and fundamental link because of how information and the visual structure is formatted and how it needs to be represented. And this framing actually is nice because it makes a wide bridge to the AI and ML communities.
The corresponding name for this area is the science of machine learning. There are AI-vested interests in explainable AI and understanding units, in doing spectral analysis of the population code to tease apart sparse and distributed populations and superpositions. And all of those insights now can directly be ported into the systems neuroscience onus to figure out which of those are relevant for the biological system and why. And I think this is a big bridge that we can cross.
I am out of time. So I just want to end with one thing. This lets us do mechanistic-controlled rearing experiments. And it lets us get to the deep questions of, fundamentally, these are powerful learning systems. That's what makes them special, which means that we can now have all this new traction on understanding what kinds of things you need to be built in, given what kinds of experience, to give rise to what kinds of representations. And that's the whole game. And that's where we get to play. Thank you.
[APPLAUSE]
JAMES DICARLO: OK. OK, the final panelist to speak to you is Gabriel Kreiman from Harvard Medical School and Children's Hospital. So Gabriel.
GABRIEL KREIMAN: Thank you very much. I'm very excited to be here and to have shared these 10 years of CBMMs. I want to thank especially Tommy and Jim and all the colleagues at CBMM who have made these 10 years so special. So I want to start with a couple of very basic points that are, I hope, not very controversial at all, and then progressively go to some more controversial things.
So first of all, this is here on your left the famous painting saying, this is not a pipe. This is a representation of a pipe. In a similar vein, I think it's pretty clear and everybody would agree, this is not a brain. This is a modular representation. And I would contend that understanding brain function-- especially the visual system, but actually any aspect of brain function-- means building theories.
We're not going to understand brain function by sheer data collection and sitting on a chair and figuring out how things work. We need theories, and we need to instantiate those theories into models. All models are wrong. Some are very useful. And in that sense, I'm excited about neural networks and how inspiring they have been, and how profoundly they have changed our field in the possibility of asking questions.
So I want to just mention quickly this idea of David Maher and Tommy Poggio, the three levels of analysis that we need to understand systems at a behavioral level at the level of algorithms, and then ultimately at the level of the implementation and neurons and hardware. And just to give you a sense of the trajectory of CBMM in the last decade, back in-- and this is a paper from Tommy Poggio from 1999 introducing HMAX and one of these neural networks architectures to study visual processing.
Back at the time-- I'm actually not sure, maybe Pietro can tell us-- in computer vision, maybe there were a couple hundred people that were doing neural networks. And in neuroscience, I would say probably around 10 people in the world were doing neural networks. So the trajectory over the last decade has been amazing. So now the system, we have democratized the system so that everybody can play with these neural networks as a foundational way of thinking about and playing and generating hypotheses, which I think is something that's quite amazing that has happened recently.
I also want to give a big shout out to the idea of sharing code and sharing data, which has enabled this kind of work. So there are many, many issues with current neural networks. So I think they are quite amazing, and I particularly like the idea that Jim put forward as the virtuous loop between generating hypothesis, testing them in validating models, building better ones. I think that that's the way to go.
These are some of the main issues. People in general have established that current models are quite fragile. It's pretty easy to deceive them in many ways. They fail to generalize in major ways. And despite exciting and enormous progress, I would argue that there's a major lack of alignment with neural, as well as behavioral, data. So what's missing? Well, plenty of things are missing. And I think this goes back to what Christof was mentioning earlier.
So here on the left you have a famous diagram by Felleman and Van Essen, which describes the mesoscopic connectivity of the primate visual cortex. It's pretty clear that our best models are still a far cry from what's actually happening in cortex. And in addition to cell types, which was already mentioned by Christof, we now have access to the beginnings of what will be eventually full connectomes of systems. Quite far from that still in cortex, but we have this ability to interrogate the detailed connectivity of the circuitry.
So I think in the future, Christof said nothing will make sense without thinking about cell types. Nothing will make sense without looking at the connectome, as well. So I think we'll have to incorporate these ideas. Do we need every single detail of biology incorporated into computational models? No. Part of the idea of building theories and models is to actually be able to have abstraction. And abstraction is critical.
So here are some of the things, perhaps more or less controversial, that I think are going to guide us over the next decade. Think about learning. Jim already mentioned that. Tommy has mentioned this many times, the very strong differences between learning in biological and artificial neural networks. Basically, complete lack of dynamics and temporal integration, and thinking about context both in time as well as in space.
The fundamental horizontal and top-down connections, which are abandoned-- in fact, much more abundant than the purely feedforward ones. And then another point that people have made quite often is the major difference in terms of energy efficiency that will probably play a very important role in developing actual hardware and actual systems that work in the real world. So that's all I wanted to say. And I look forward to all the discussions in the panel.
[APPLAUSE]
JAMES DICARLO: Thank you. Thank you, Gabriel, again. Right on time. Perfect. OK. So I'm now going to ask the panelists to come up and sit at the table, and we're going to try to motivate some discussion. These were the questions I had seeded a bit that I thought might be controversial. We'll see if they like their starting position. So the first question was that some, but not all, deep ANNs. So the idea is not the same general ANNs, but some specific ANNs are they derived, you could say, from current AI or computer vision? however you want to say those models are emerging, whether we're talking about alexnet or resnet-- are also the current best models of the mechanisms-- notice I'm saying these words carefully-- by which primates, perhaps humans, that process sensory input. So I think that Ila says that's not true. Christoph says it's not true. And Gabriel also says that's not true. But Talia and Nancy both say it is true. So there's some controversy here. I'm like the McLaughlin group. Does anybody ever seen the McLaughlin group? Old show. It's like, I'm going to be McLaughlin. I have my answer in reserve. You guys may already know what my answer is. And I'll tell you guys the right answer at the end, but let's let you have this discussion. So go ahead and discuss. Anybody want to chime in on this one first to start?
NANCY KANWISHER: Well, tell us better models if you want to know.
JAMES DICARLO: Well, let me--
NANCY KANWISHER: Nobody's saying they're perfect.
JAMES DICARLO: Let's save that for later. That was question three. We'll come back to the better models. But this is just--
NANCY KANWISHER: It's relevant.
JAMES DICARLO: There are existing models. And there's a claim that they are mechanistic models of primate ventral stream processing. And I don't know, Nancy. Nancy, you're on a true defense. Maybe you want-- maybe let's talk with someone who says wrong. So who wants to go? Somebody is false. So Christof, why are they not mechanistic models of visual processing?
CHRISTOF KOCH: No, they are.
JAMES DICARLO: Oh, you switched.
CHRISTOF KOCH: No, no. Yes. By some standard, they are the best model. I think they miss big parts, like they tell us nothing about cell types. And I compare it to a little bit like a model of retina. So if you just take the DOG model of retinal processing where you just filter the image through difference of Gaussian you can say, well, that's a reasonable first-order model of the retina. But of course, it misses out a lot of the different cell types. So in that sense, it's the best current model there is.
JAMES DICARLO: OK, so it's the best current model, but it's missing something, to summarize. Nancy wants to jump in, I think. Go ahead, Nancy.
NANCY KANWISHER: So Christof, you point out that one of the key differences, key imports of cell types is that they have different patterns of connectivity. Absolutely. But so if we could set up ANNs with all the possible patterns of connectivity, including the lateral things that are not really incorporated now and all of that, could you in principle discover cell types in ANNs, or do you think you need some other aspect of the hardware beyond connectivity?
CHRISTOF KOCH: You probably need to put in more specialization.
NANCY KANWISHER: Like what?
CHRISTOF KOCH: Units aren't-- well, that the units aren't all the same, that they perform different computations. And most importantly, they send the information to different parts. This is all about the modularity that you talked about. It's not this homogeneous layer, but there are all these different parts. That's why there are 371 top level structures in the Allen Brain Atlas because it's not just one big gigantic homogeneous processor. It's very different from those guys.
NANCY KANWISHER: Right. But in principle, we can discover those connections.
CHRISTOF KOCH: Oh, yes. Of course we can.
NANCY KANWISHER: With models. And then the question is, what else do we need in terms of cell type specificity beyond differential connectivity? You say different--
CHRISTOF KOCH: That's probably the main one. That's probably the main one. I mean, the metabolic constraint, the developmental constraint, the evolutionary constraint. We know they're all mimicked, they're all like a palimpsest. It's all written in there, but that's probably less relevant to just have a mechanistic understanding of the adult visual system.
JAMES DICARLO: OK. So let's let others chime in. Let me see if I can summarize the arguments so far. So Christof would say we can optimize models with certain architecture, but if we optimize them with a more precise architecture aligned with the biology that we would get true understanding. That's sort of a summary of your argument, I think. Whatever true understanding is, is still left to be determined. But that's, I think, what you're saying. OK. Talia, do you want--
CHRISTOF KOCH: Yeah. I was just going to draw from what Gabriel said, which is that models are about the level of understanding. And if your phenomena of interest is cell types, DNN is terrible. You don't have cell types, or you have to invent that. They don't immediately make contact with your phenomena of interest in your data. So they're terrible models.
I actually can understand the position entirely. If you study high-level vision, they're everything because we had nothing that could make contact, or just the early stages of it. And so it filled a gap that we missed. But if that's not your phenomena of interest, it's more or less appropriate. And so I see the point, which is like, well, if you want to understand cell types, this may not be the best model yet. That's fair. But if you want to-- yeah. So I think this is about the zone and--
CHRISTOF KOCH: What about if you want to understand the system in pathology? What about if you realize that some pathology are very specific to specific types of cells?
TALIA KONKLE: Yeah. I think that's also the strongest argument for when things go wrong. Are they going wrong at the mechanisms that we're abstracting over to make this model work or not? That's an important question. Or as Nancy said, maybe we can just build those in.
JAMES DICARLO: OK we got Ila or Gabriel, you want to chime in on this question?
GABRIEL KREIMAN: I want to throw out a wild conjecture that if Isaac Newton had had neural networks, he would have played for many, many years on trying to feed the trajectories of soccer balls spinning, and so on. And he would have had great success because he was a very smart person, and he would have had a great success in predicting the trajectories. And he would have never perhaps discovered the universal laws of gravitation. Feel free to discuss whether this is correct or not.
But just the idea that we can actually fit data doesn't mean that we have a mechanistic model. At the same time, I want to emphasize that the brain is a neural network. So ultimately, any kind of computation, any kind of theory, any kind of model needs to be linked back to some flavor of neural network because the brain is a neural network. So ultimately, we actually need to link back to this level of explanation. So I'm not opposed to neural networks. I'm not saying neural networks-- I think they're here to stay. I think that they will not disappear. I just think that we need much, much more than showing that we can take data and feed data with overparametrized models.
JAMES DICARLO: OK, I see Nancy, but if Ila wants to go first because she hasn't had a chance.
ILA FIETE: No, I completely concur with Gabriel. And I would say-- to be provocative here, I could list the ways in which I think that these models are useful, but also important ways in which maybe they're not, or at least leading us down the wrong path. And so one of-- in the current iteration. So one is that even if we do what Christof is suggesting and Nancy is agreeing with, which is to build in more cell types in more detail about the connectivity and so on, we might get more and more detailed matches between the biology and the circuit model.
But that doesn't mean we understand, to Gabriel's point, and I completely agree with that because that is not a principle, a detailed model. So it's like Borga said, that if you give me a map of the world that's as big as the world, that's not really a map. So you really do need some principles. You need some abstraction and some understanding. So that's one way in which they're failing us.
And we've certainly seen that it's very challenging to extract principles from these deep neural networks. They end up being a very detailed map of some very complicated processing thing. And then it's just another area of science, which is to study how they work. And it's not clear to me that there is a way that we have to extract something really abstract and a sense of real understanding from those circuits, even if they really emulate the biology.
The other point is that they're very far from emulating biology. So I would argue that at the current moment, these deep networks are good models for cognitive scientists, as we can see. Our cognitive science colleagues are the most excited about them. And I would say they're terrible models for neuroscientists. And it's because I think that they are the first computable model for images, which is what cognitive scientists have hoped to go from inputs to behavior.
And in that sense, they're doing an OK job. But in terms of neuroscience, they're not providing any really rich predictions about circuit dynamics and perturbations. I have yet to see a perturbation-- well, except for some very interesting work from Jim. Yeah.
TALIA KONKLE: It's coming.
JAMES DICARLO: I'm biting my tongue here.
JAMES DICARLO: But de Novo perturbation predictions about removing a type of connection and seeing what results. And I guess the last thing I'd say is that some basis for skepticism is merited from a theoretical perspective because if you take two completely different computing systems and have them both successfully solve the same task, they may have similar representations because they're each solving the same task, and there's some fundamental variables that must be represented to be able to solve those two tasks.
So in so far as they both represent those fundamental variables, there will be a similarity. And the question is, how much similarity is there beyond that? Because only the similarity beyond that would we want to say that this is a specific model of the biological system doing that computation as opposed to two completely parallel systems that are doing the same computation.
And so I guess I haven't really seen strong metrics for that. And I think there's a lot of scope. At this point, it's almost like saying, are ANNs good models of anything? It's kind of like saying, are fractions good ways to do anything? It's like, yeah, we all use fractions in everything that we do. So at this point there are two, but I would say that the state of the field right now leaves much to be desired.
JAMES DICARLO: OK. But I said specific models, so you kind of generalized it, and I was careful not to say that.
ILA FIETE: Fair enough. Fair enough. I'm being provocative.
JAMES DICARLO: So the statement was not general fractions are useful. The statement was, there are specific models that are mechanistic approximations. So that's-- that's not an entirely fair argument to the question, but I will accept the statement. Sorry, I'm going to be a little more 'Dan-like' if anybody's seen the McLaughlin group. I'm going to tell them-- Nancy, did you want to add in before I try to summarize this conversation and maybe--
NANCY KANWISHER: I was just going to briefly respond to Gabriel's point that if Isaac Newton had had fancy CNNs, would he have discovered physics? Well, maybe what he would have discovered is that the CNNs actually couldn't predict what was going to happen next, and he would have to build in a lot of structure. I'm looking at Josh and Liz. Maybe that's what we would have discovered, and that would be super interesting.
JAMES DICARLO: OK. I want to react to a point that someone's-- the point that Ila said that these are great models for cognitive science but not for neuroscience. When I talk to my cognitive science colleagues, I think it's the opposite, that these are limited models for many cognitive functions, but that they have map ability for neurons so that they're good for neuroscience. So you know you're doing the right thing when nobody's entirely happy with the models. And so I think that's where they live.
I think another fair summary of this is that you notice there's different goals implicit in the discussion. So Christof is channeling it, which I agree with as an MD, if you could connect to the cell types, that opens up control tools or maybe fixed tools for certain disorders that the models don't yet connect to. But that may be a different goal than the theorist goal of, if I could abstract away some principles of how these models work, which you heard Ila and Gabriel, especially, channeling.
But both of them agreeing that you need models in the middle to help even be the glue to that. So at some level, I think the crowd is smarter than I am. They're converging on where I think the reality of the field is. I hope this was helpful for all of you in the audience. Before we move on to the next question, I see there may be some questions about this one before we go on-- which these questions are all going to connect to each other. Let's let the audience chime in here briefly, and then we'll go on to the next question.
And you guys are supposed to use the mic, I was told. So if there is a mic, if it could please come around, or speak loudly. I don't know. Go ahead please. Yeah. I think we've got Shimon, Kobe. Let's go.
AUDIENCE: So first of all, thank you for the awesome question. So Manolis Kellis I'm a professor in computer science across the street, but also work in neuroscience for some reason. So I want to go back a little bit to the cell types. So how do we think about cell types? And how do we think about cell state? In some ways, the way that we're thinking about neurons right now in the AI field is automata in a way. Every single neuron has the same types of input, same type of output. And I think that's one of the things that you're concerned about.
The other question is, how much space should there be? How complex should the automaton be? How complex should the computation be? And in a sense, every cell has, like, 20,000 genes. It could do whatever it wants with them. And we could also think about gene expression as not necessarily cell type, and maybe the state of the neuron, that they all start with the same automaton and they reach different states. Maybe those states are guided by the inputs.
Maybe those states are guiding the type of computation that is going on. So I think that there's so much to be learned. I don't have any answers, but I think that there's so much to be learned by studying the types of outputs that we could get if we could enrich the diversity of neurons in these ANNs, if we could get inspired by the types of connectivity to perhaps refine the state, refine the types of computation that is going on.
And if we could start from the neuroscience and then say, OK, what are the types of computations that are actually being carried out to then guide. And the other comment I want to make is that all of this is the 0.2 seconds of visual. And then there's all of the learning. And in many ways, I think of all of these abstractions as effectively learning a representation. And then the fun part begins with thinking.
And maybe-- we haven't even begun on that part, of what actually happens in the frontal cortex after all of these multisensory inputs are coming in. Maybe that's closer to LNNs. Maybe that's closer to LLMs. Or maybe that's closer to attention. Maybe that's closer to transformers. Maybe that's closer to GNNs. We don't know. But basically, I want to point out that maybe all of this is just the representation learning part, and that's the easy part. But now that you have the representations, you can do all kinds of cool stuff with it. Anyway, I'd love to hear your thoughts.
JAMES DICARLO: OK. You guys can talk. First of all, I don't say it's easy because it took a while to get there. But people are thinking about this, and the next panel is going to actually engage on those questions about beyond that. So I think we don't think 200 milliseconds of all of intelligence. And there has been work in that area.
It's just not the focus of the discussion right here exactly. We were focused on the interplay at this scale, so just to set some context for the rest of the audience, and my fellow colleagues who I know would be probably screaming right now if they were up here. So I'm trying to channel them. Do you guys want to respond to that? Or do we want to-- anything burning anybody else wants to say? Let's say, OK, go ahead, Shimon, and then we'll take one more from Kobe, and then I want to move to the next question.
AUDIENCE: So briefly, usually you're careful to say that you're talking about the first 200 milliseconds. It's not here on the slide, and I just wanted to emphasize that this is true for the 200 milliseconds. Once you go beyond, there is the top down coming in, the response of the cell. Cells change in a way that these models cannot explain. And they cannot explain the learning, if you involve also learning some new images, some new tasks and so on.
Even if you look at early stages, the response will probably be different because the kind of backpropagation we use in model and the kind of learning used in the brain are not in great alignment. So it's true, it's for limited things, no learning, 200 milliseconds. I think we have good model. But even for the ventral stream, once you go beyond these limitations, the models are not great.
And maybe there are models who do not perform that well, but already include some lateral connection and feedback connections, are not producing these results as well as the current models. But they have some more to say about how the ventral stream works under more general conditions. So it's not to belittle what's going on. There are great discoveries. But these other limitations apply even for talking about models of the ventral system.
JAMES DICARLO: Great. OK. Does anybody want to comment on that? Maybe Talia--
CHRISTOF KOCH: Burning comment. I totally agree because even within that 200 millisecond, we have already, of course, massive feedback from layer 6 into the thalamus, from layer 2, 3 back into layer 5 and layer 6. So it's not just 200 millisecond feedforward and only then does the feedback come.
JAMES DICARLO: Talia, did you want to say something?
TALIA KONKLE: Yeah. I would say I'll agree with all those things, but I would add yet. We just got these. People are building long-range feedback connections that have attentional modulation that go with the state, that amplify but don't hallucinate. People are building recurrence. People are thinking-- yet. There's all this visual cognition and visual neuroscience research. And we have all these phenomena.
We've been tinkering. We figure. We have a hypothesis about how those feedback connections need to be structured-- for instance, feature-based attention operates over the whole field. OK, I can think about how I would need to wire and set it up so that that would happen. Yet. We're just starting. They're happening. So I think that it's a mistake to see differences in how the models are just a poor approximation of the biology and be discouraged.
CHRISTOF KOCH: Oh, you shouldn't be discouraged. You should just acknowledge those differences because they drive new research.
TALIA KONKLE: Yeah, that's right. But I think it's often tethered with, thus these are not getting anywhere or going anywhere. And I think, yeah, just a glass half full.
JAMES DICARLO: Kobe, last question on this one, and then we're going to move to the next question. Yeah.
AUDIENCE: Very general question. It is we, humans, the possessors of brain, sitting here outside the brain and trying to draw parallelism between the brain and artificial intelligence systems of different sorts. When do you think, is the next annual meeting in which in this table, there will be two AI systems and what they think of the parallelism between them and us?
JAMES DICARLO: I don't know if anybody wants to jump in.
TALIA KONKLE: It's after the 200 milliseconds.
JAMES DICARLO: It's well beyond my--
NANCY KANWISHER: Some other panelists--
AUDIENCE: So if you're the closest to an AI machine.
ILA FIETE: Within five years. Five years.
TALIA KONKLE: That's right. Take your estimate and halve it.
JAMES DICARLO: ChatGPT, will you let in five years ChatGPT be a member?
CHRISTOF KOCH: Exactly.
TALIA KONKLE: For sure. We could do it today. We could ask what it says.
ILA FIETE: That's a good idea.
JAMES DICARLO: It's a good suggestion. We should have prompted it with the question and see what it says. If anybody wants to do it and give us the answer while we're offline. I'm moving-- thank you Kobe. That's really provocative and motivating. I think we'll-- and we'll see about ChatGPT sitting here soon. Let's go to question two. Again, I was trying to keep us focused about concrete stuff here. So again, our panelists, let's see what I write.
The most important discoveries from neuroscience have already been incorporated in these models. We've heard this that we heard already examples to this question. So you guys have essentially already covered this a bit. I would like to ask if the panelists think we've talked about things that relate to timing and feedback. We talked about cell types. What are other things that are not yet incorporated?
So everybody said false. Nancy, I didn't want what you said. So everybody thinks something's got to be built into models. And you each have said a little bit about some of that. If people want to amplify this as an extension of question one of what's your intuitions about what's missing, if I could channel people. You're saying, well, we don't have the right learning mechanisms as a general theme and the emergence from biology. That was her intro.
Christof was saying we don't have the cell types and the connectivity, and somehow we need to put that in. Let's see. Who else am I going to challenge? Talia would say, yeah, we don't have the timing right, and that'd probably feedback connections, but let's not throw out the baby with the bathwater. Let's see if we can put that in. And your answer is exactly what I would have said, so great, you got the right answer. I'm just kidding.
And then Nancy and Gabriel, I don't know if you'd like to add or if anybody else would say what is missing from that you want people to think about that maybe you're not thinking.
CHRISTOF KOCH: I think causal-- we'll really be surprised by causal manipulations in the human brain. Less the mouse brain, but the human brain. If you think about all sorts of using fancy technology you know like Neuralink, et cetera, where we can directly stimulate and inactivate other parts of the brain, things like psychedelics that are very powerful modulator of visual experiences using as a tool. So I think that'll drive novel discoveries.
JAMES DICARLO: So experimental-- you're making suggestions for interventional experiments.
CHRISTOF KOCH: Interventional experiments, particularly in humans, particularly in vision. You know, obviously, the difference between us and rodents.
JAMES DICARLO: OK. As more strong perturbation tests of models in the spirit that Ila was saying, or the kind of things you want to do.
CHRISTOF KOCH: How does your model respond to psilocybin?
JAMES DICARLO: What is that?
CHRISTOF KOCH: Psilocybin.
JAMES DICARLO: But that would require models that have inputs for such a thing, which would require cell type modeling.
CHRISTOF KOCH: But if you're putting in top down inputs, and there you go.
ILA FIETE: We need those knobs. We need--
TALIA KONKLE: We need the knobs.
ILA FIETE: Yeah. We need to connect the parameters in the model with tunable things from the outside world.
CHRISTOF KOCH: This is where cell types, of course, becomes critical because serotonin goes to specific cell types.
JAMES DICARLO: OK, good. Nancy or anybody else-- you were going to say something. And Gabriel, I was going to try to also give you--
NANCY KANWISHER: I'm just going to say not what I think the models are currently missing, but just to re-emphasize the point that I think the current tests of models are impoverished in the way that I mentioned, and more literal direct comparison of model units to brain units will take us farther in assessing them.
JAMES DICARLO: OK. So stronger tests might reveal what we're missing, but you're not making a statement of, here we're obviously missing X, Y, or Z. nobody is. OK. Gabriel, anything you want to add?
GABRIEL KREIMAN: I agree that tests are still very poor. We mentioned already learning. We talked about the intrinsic connectivity of the circuitry that's way more complex than what we have in neural networks.
JAMES DICARLO: OK. So if I could-- there's a general optimism. We've now got this exciting thing, the objects that we can study, as Talia said. But we know they're just maybe more the beginning than the end, and we need to do stronger experiments and stronger comparisons that will lead us to add some of these things, we hope, in the spirit of good science.
CHRISTOF KOCH: Well, we're also totally missing consciousness.
JAMES DICARLO: OK.
CHRISTOF KOCH: All I get is embarrassed, nervous laughter because we can see things.
JAMES DICARLO: OK. So that might be on the next slide. Let's see if that connects and maybe we will return to this. So this is my third and last motivated question. Again, we can back to audience questions. So over the next decade, the field will migrate away from ANN models towards other types of models. Here I was saying sensory processing, but maybe we want to talk models more broadly.
And I think, again, we've got some controversy here. Christof is saying we will move right away from these. So is Ila, but everybody else is saying, no, no, we're going to stay with these kind of models. So there's some controversy. We'll hold the consciousness question because of that towards the end, I hope, but let's make sure we return to it.
We couldn't contact it in the context of visual perception, which is what these models are supposed to be also trying to achieve. So if you guys want to speak to that as part of what a new model should do and connect to just visual perception-- maybe not consciousness entirely yet if we could keep it a little grounded.
NANCY KANWISHER: Yeah. So I just want to say since I'm down here saying false, what I mean by that is that I think the current models may suffice for a chunk of the first 200 milliseconds, or maybe 150 milliseconds, of basic pattern classification. But that leaves open the whole universe of all the more interesting stuff we do in vision, like see the relationships between objects and their physical relationships, social relationships, predict what will happen next, all of that. And I think it's just a wide-open empirical question, and probably we'll need very different kinds of models to accommodate all that.
JAMES DICARLO: So you're switching from false. You're not saying we're going to migrate away from ANN models. You're saying we are going to migrate away.
NANCY KANWISHER: Well, I think it depends on what we're trying to model. I'm saying it's possible that the barebones pattern classification, this will still be the best thing. But that is a tiny piece of vision. OK, yeah.
JAMES DICARLO: Anybody else want to say, are we going to be using ANNs in 10 years, like are we going to-- or not?
ILA FIETE: Because it partly depends on what that means, right?
JAMES DICARLO: OK, part of a fair answer.
ILA FIETE: If it's just a--
JAMES DICARLO: Go ahead, Ila.
ILA FIETE: Yeah. It depends on what you mean by ANNs. If this is an infinitely flexible definition, then of course we'll still be using them. If any possible model that requires-- I guess I don't know what that means, exactly. Is it gradient learning? Does that mean cells with not rich dynamics, individual cells with not rich temporal dynamics, yeah, that doesn't mean-- it depends, I think is the answer to this. Of course we will-- again, back to the fractions analogy, we will always continue to use fractions.
But I guess I wouldn't say that my model is a fractions-based model because I've got fractions that I compute in them. So in the same way, I think that--
CHRISTOF KOCH: I use decimals.
ILA FIETE: Yeah, you use decimals. OK, you're decimal-based, oh. OK. You're about a century ahead of me. But I guess, yeah, depending on what we mean by that I think that the answer is either true or false.
CHRISTOF KOCH:
TALIA KONKLE:
TALIA KONKLE: I think DNNs are an expressive enough language that they are going to be what we use because even if they don't have cell types, we can go, but I know what I would do to make them have cell types. Even. The people who go, it doesn't have attention. But a cognitive scientist, well, I know what I would do to add mechanisms or algorithms that could have it have attention. So you can come at it the same language, the same model, from cell types or from top-down feature-based attention and have an idea of what you'd need to do to change it to fix it.
So I think that puts it at the sweet spot of expressivity that really connects quite a lot of disciplines. And we're going to be using them for a long time. Sticking to my answer.
JAMES DICARLO: OK. Anybody else want to--
GABRIEL KREIMAN: So I was ambivalent about how to answer this, but I want to go back to what I said before. I think the brain is a neural network. So ultimately we'll have many other much better and more interesting models. But ultimately, we will want to map those into circuits of neurons and how they connect, which is a neural network. So we may have spikes. We may have dynamics. We have cell types. We may add all kinds of bells and whistles and all kinds of things.
But ultimately, any model that we postulate-- and we will postulate much, much better ones-- ultimately they will have to be mapped onto some sort of neural network.
ILA FIETE: So maybe we're proposing that we're post-ANN in the sense of, we want be talking about whether it's ANN model or not, right? It's just a model that uses ANNs in the same way that I use fractions.
JAMES DICARLO: So it might not be-- it might be, to just kind of channel back, it may be not today's ANNs but some other ANNs because we think the brain is a neural network. But let me push this a little bit because then we end up with a giant ANN model, the map of the world, to push back. It's like it's ANN model.
ANN-- maybe it's not your grandfather's ANN. It's the next generation ANN, but it's still a big, complicated computer model of some level of detail, maybe down to the cell types. Again, scope this on the visual system if you want. We could do it for the whole brain and our thoughts. But even just the visual system, we could imagine that. And Ila, you said, well, that'd be like having a map of the world and it doesn't count as it's the size of the world, so it doesn't count as an understanding.
I just want to add a comment there that it might be like, well, if that map of the world is a lot smaller and runs a lot faster and so forth, it might still be a useful object in the same way Nancy said in the introduction. But I think many of you would say you want something more than just having a model that's an ANN. So maybe we could-- I may be shifting the conversation. Maybe we agree that depending on the language, this was sort of a trick question.
But let's come back to what this notion of understanding might look like, even with the next generation of ANNs, because I think that's somewhat one that's deep in our field of how we think about what success would even mean. And your map of the world example prods that notion, even if it is your next generation ANN. So maybe you want to each jump on that, and then we'll go to audience to see if they have thoughts.
CHRISTOF KOCH: So we are trying to build this model of the-- or we have built it for primary visual cortex. We know more about mouse V1 than any other cortical area, including now the connect home. And so it's incredible detail model. It has the exact sign up of the exact location. Yes it's computable, but it's very tedious to manipulate because anything you do, you're constrained by all the other constraints. So in that sense, I think these ones are better models that would generalize.
So you could essentially generalize that. And people will do the entire mouse. Again, that might be useful to compute things like local field and EEG, but it's not useful to think about computation. It's simply too cumbersome. And the dimensionality of these models is millions of unconstrained parameters.
JAMES DICARLO: But imagine we call it-- so you can have a visual system simulator, and it could run really fast. And maybe we might not call it an understanding, but it might be a very useful object for humans to possess. Is that a fair summary?
CHRISTOF KOCH: If you're an engineer or neural engineer, you want to predict, I put this drug on there and I want to predict ahead of time how the brain will respond. It'll be perfect for that.
JAMES DICARLO: OK. And maybe that might-- and so the question is, maybe that might arise before these other types of understanding. And would that be good or bad? But sorry Ila, I cut you off.
ILA FIETE: I was just going to say it's like the idea of an emulator or a digital twin right. And in that sense it's very useful because instead of doing the experiments on-- the so that map of the world that's almost as big as the world, yeah, it could be useful because you can't do experiments on humans that you would like to do. So I think it would be useful, but I think that would be a piece of it. We would still seek understanding. I think we would still seek the abstractions and the simple models.
And there's another really interesting thing, actually, that has come up in my group-- is it ever possible to fully even-- is that like a quixotic quest to try to even build that emulator or that full-scale model of the brain? And in the future, is that what we're going to want? Or are we going to want to do it more like a control system? So there's the other version of building a system where you have a model and you want to use the model to achieve certain things, which is I want to be able to say, given this disease, I want to figure out what I should control to bring the symptoms down.
And the question is, is the best model for that going to be-- and I don't know the answer at all. Is that going to be a full detailed emulator? Or is it going to be something, it's a control model where just have a few knobs that you actually can control, and then you figure out how to apply that control? And that's a really different perspective. And I don't know at all. I'd love to hear what you guys--
CHRISTOF KOCH: Talia, yes.
ILA FIETE: Probably yes. You mean instead of a full-scale emulator.
CHRISTOF KOCH: No, both.
ILA FIETE: You would want to control-- oh, both. Yeah. Yeah.
CHRISTOF KOCH: Depending on what you want to do.
ILA FIETE: Yeah.
TALIA KONKLE: Yeah. That goes back to models as a model of the system versus model organism. So an emulator is, let's keep adding things because we want it to look just like that and have all the pieces. But another way of asking that question, instead of going, what should we add, let's look to biology and see what we should add, is what capacity or phenomenon do we care about? And what's the mechanisms we need to bring that about?
So if you care about top-down attention, you're probably going to need feedback. But you might not need something else. I can't think of it. We believe that there are joints and that there are meaningful links between sets of mechanisms and capacities. And so then we're going to have that model. We'll build all these models. Then we'll have a proliferation of our models.
Some are really good. If you want to study it, you use these ones. They have the right feedback connectivity. If you don't really care about that, you use these ones because these ones have the cell types that are going to get at a different kind of phenomenon. And so I think what we might end up with is a really amazing emulator that we don't understand, but it's really useful for some purposes.
And then a lot of these, maybe you can call them control systems-- separate models where we understand how they work and they're suited to certain capacities but not others. And that's a way to get an angle into what matters for what that's not just one big soup. Yes, I see nods. That made sense. Good.
JAMES DICARLO: Anybody want to add on that? OK.
NANCY KANWISHER: I think that's exactly right. I think the kind of model you want to build depends on the thing you're trying to understand. And so different phenomena will be important for understanding different things. And so it's actually a strength that you have this proliferation of models of different parts, because each of them can tell you, these are the aspects of the system that are important to produce those phenomena. So totally agree.
JAMES DICARLO: So maybe the dream of the field they often present when I was a department head is like, we want to go from molecules to mind, which would lead to a giant emulator model for models of the mind. Maybe the dream of the field is going to be a connected set of models that actually together allow us to achieve the dream of the build, not one big monolithic thing that goes from molecules to mind. This is a summary of maybe one version of what you guys are saying.
TALIA KONKLE: I think you want both. I think there will be some questions that really do require the monolithic model of mind. So I think only picking one or the other would be a wrong bet. Let's do both.
JAMES DICARLO: OK. Let's turn it over to audience questions. I prodded you guys enough with mine. I could prod some more, but let's give people a chance in the last 15 minutes, especially people who haven't gotten to go. Yes.
AUDIENCE: So I would be curious to hear from the panel, what do they consider to be the top one or two insights that neuroscientists have gotten in the last 10 years by using DNNs as models? And so I want to exclude a couple of things. One is, I'm not talking about using DNNs to analyze your data, but as models.
JAMES DICARLO: Insights from neuroscience.
AUDIENCE: Insights within neuroscience. So let's do a thought experiment. We take away DNNs from the last 10 years of history of neuroscience, what would we not know today? What would be a key insight that we wouldn't have gotten? And so I want to exclude another thing which is, well, we don't understand well what DNNs do. We don't understand what neuroscience is. We feel more comfortable not understanding neuroscience. OK, this is, I would like to exclude this one. So let's see a positive insight. And so I would love to hear from the panelists what they consider to be the top one or two.
JAMES DICARLO: Ila looks like she's ready to jump in.
ILA FIETE: Piedro, I love that question. And I would actually would have been very happy to pose that question myself. And I sincerely don't know the answer to that. I would say that in my own field-- I'm not speaking for visual neuroscience, but my own field where we talk about navigation circuits and grid cells and hippocampus, I would say that there's been a lot of activity building models of that using ANNs.
I would say that those models have not yet contributed any new predictions or insights. That would be my strong assertion. And I would love for somebody to disagree and give me examples, as well. In grid cells. I'm not talking about the visual system. I very, very clearly want to say that.
JAMES DICARLO: I didn't quite get-- the question is, what has neuroscience contributed to the world? Is that a summary?
NANCY KANWISHER: No.
[INTERPOSING VOICES]
JAMES DICARLO: OK. What has AI contributed to neuroscience--
ILA FIETE: No, no, no, no.
AUDIENCE: We go back to Gabriel's statement by [INAUDIBLE], which is some old models are wrong, some are useful. So if they have to be useful, they have to be useful for something. And I would say that as a neuroscientist, I want to understand how the brain works. And so by looking, learning about comparing whatever, these models with the brain, what is it that I am able now to do that I wouldn't have been able to do had I not had access to these models in the last 10 years?
CHRISTOF KOCH: So for me, it's at least existence proof that simple models having simple receptive field wired up in very simple way can do, to me, very remarkable things that seem to be close to some of the things the brain is doing early on. And we didn't have that 10 years ago.
AUDIENCE: So it's called for more than predicting?
CHRISTOF KOCH: Well, no. It's existing proof-- no, it's not-- no, it's existence proof that one type of computational architecture seems to be along that maps in some way on the brain seems to be along the lines--
AUDIENCE: We can stay the course.
AUDIENCE: The comment on the H-max made that point.
CHRISTOF KOCH: Yeah, but now in a generative construct, if I can show an image and see the output.
AUDIENCE: But tell me something positive that we wouldn't had we not had access to these models. You're telling me that you sleep better at night, but you're not telling me what you can [INAUDIBLE].
JAMES DICARLO: I certainly have answers to this, but I'm going to let the panelists-- Nancy and Talia have their hands up, and then Gabriel. Sorry, you guys were first. Nancy, did you--
NANCY KANWISHER: One thing that I don't know we totally know, but we're starting to get a glimmer of, is that some of the structure of a visual system can emerge without some of the particular domain-specific constraints one might have thought. It doesn't mean that that's necessarily how it emerges in brains. But it feels more possible, to me, than it did a decade ago in the sense that you just use self-supervised training on some arbitrary CNN on a bunch of natural images.
And it has a lot of similarities to the brains, including subsystems that are selected for faces and bodies and places and stuff like that. And nobody told them, nobody told the network, that those were important categories. They were not fed in as categories. So to me, that's really bracing. It says a lot of this stuff might in principle just arise from experience with minimalist priors.
JAMES DICARLO: Talia.
TALIA KONKLE: Yeah, just following up on that, I think this is where there have been a lot of breakthroughs in high-level vision about learnability arguments, how much the structure of the natural image statistics plus these self-supervised objective just learn to discriminate all the things you see in your world in this hierarchical way gives rise to all kinds of things, including face-selective clusters, body-selective clusters, major divisions that we see.
And it gives us access to how those were computed. And that has other consequences, like a deeper understanding of the tuning. So I'm going to pick at face cells because we've spent a lot-- we made the most inroads there. It's the front edge of the high-level vision sword, which is when people go to study faces, they often look at variation within faces or how faces vary.
But one of the deep insights that's come out recently is that to understand the tuning of a face cell, you actually need to understand how faces fit within the context of all the kinds of things you see, because the axes of face cells are not only about within face variation. They're about separating faces from all the other things. And so all the models only looking at face variation to study face cells have missed quite a part of the response and a deeper understanding of how it gets there and why. We would never have got that without self-supervised visual learning.
CHRISTOF KOCH: Also has influenced neuroscience. It's also doing different experiments.
TALIA KONKLE: Yeah. I mean Doris Tsao's science paper is, for instance, like Bao 2020 at this point. Nancy Altman, it's changing the way we study and think about the tuning. It's change-- and it's reconciling debates about distributed and modular coding, because as Nancy pointed out, the model does that by learning partially separable routes. It predicts lesion data.
ILA FIETE: Did Doris's experiment-- was it actually motivated by deep network models? Or I thought that she was just building linear decoders, right?
TALIA KONKLE: No, she was not. No, no, no.
ILA FIETE: Not by deep learning models.
TALIA KONKLE: Oh, I'm thinking about the Bow 2020 et al paper, which is basically saying, if you take the late stage of an alexnet model and you look at how all of visual information is represented in that, and you look at the first two principal components, there's major distinctions in the feature spaces, and faces stick out, and stubby things.
And then she went and scanned those. And sure enough, that feature space is in high-level visual cortex, which led her to discover bits of the feature space that had been unmapped, guided by the DNN feature space that she could then map and predict their relationship to other things. So really deep, I think, insights into the format and organization of high-level visual cortex.
JAMES DICARLO: So Gabriel, did you want to jump in? And then I'll try to summarize.
GABRIEL KREIMAN: So I think this was probably obvious to you, Piedro, at the time. And I think it was clearly obvious to Tommy and also Fukushima and others that you can actually build these kinds of systems. This was not obvious to the vision community in neuroscience at all. I think that there has been a transformation over the last decade where lots of people who never thought about models all of a sudden are starting to think about problems in quantitative ways.
So if you look at papers in vision from the 1990s, it's all mixtures of words trying to combine. But very few people, I think, really embraced HMAX or neocognitron or other models as really plausible ideas about-- so I think that has led to a major transformation. So I think now it's fair to say that in the vision community, almost everybody is using models, at least to try, test, do experiments, and so on.
One particular thing that we've done recently, which I thought was fun, was modify the diet by which we train the models and use that as a causal intervention to try to think about and make predictions about problems of visual search and asymmetries in visual search. For example, instead of having light coming from above, you can think about, what if we lived in a world where light comes from the side, or something. And it turns out that leads to quite profound differences in how we would process images.
And we can go and test some of those hypotheses, at least at the behavioral level, potentially even at the neural level. So I think the idea-- and Jim alluded to this virtuous cycle, the idea we can build models, we can make experiments in silico, then go and test them and revise them. This was not there a decade ago. I think that that's a great achievement of being able to really put people to think quantitatively about problems.
JAMES DICARLO: OK. And I would just-- we're going to move to the question. I just want to summarize. I think I was already drinking the Kool-Aid, as you said. I probably came here to work because Tommy was working on HMAX. These models are going to succeed at some point. But you're right, the field shifted. But I'll add one thing that neuroscience is driving new experiments that wouldn't do even in our own lab.
We can see neurons responding to predict-- the models predict things that we did not know to look for, and the neurons respond to them. And we still don't understand those things. And that's in the spirit of Talia saying that the experiments that Doris is doing. So in that sense, it's changing the experimental landscape of what gets done. But it still doesn't mean we're going to converge on an understanding, but it's certainly, to your question, is like it changes the way business of neuroscience is being done.
I think there's a wake-up call among neuroscientists like, what form of understanding where we actually hoping for? Which is essentially the discussion you're having here. And there's a bit of a crisis there of why people get into the field. That's my channeling of where things are. But I think it is more exciting than worrisome, and is being embraced in lots of ways.
Let's go to the last five minutes, we can take maybe-- who had their hand up first? I saw Sam in the back. I think you might have been next. I'm sorry. I don't know your name.
AUDIENCE: Thank you. [INAUDIBLE], you another student of Tommy from the generation of the 90s. Simple, practical question. We've seen the evolution of ANNs over the years with different kinds of connections, architectures, different kinds of units, relu's, sigmoids, whatever, transformers now. If you were to pick one architectural design change to make next, either in vision or in language or in motor or whatever you pick in the space, because everything is different, of course-- to go to the post-transformers era, what would be that one architectural change you would make based on knowledge from neuroscience? And bonus question, why?
ILA FIETE: I guess one is a tough one. I guess I would name two or three. Yeah, I think lateral connectivity, like re-inserting recurrent connectivity for dynamics. And so I think that feed forward networks have, as we've been hearing, been very successful in modeling the visual stream. And I think that to some extent, the big depth of those models is unfolding in time with some of the recurrent dynamics, but then that's why they're so data-hungry because suddenly they have so much flexibility because each layer of weights is trainable.
So if we put in recurrence, then those are more constrained versions of these very, very deep networks. And it allows for temporal computations. And if I got another couple, I guess I would echo Christof and say we need cell types with distinctive lateral connectivity. And then we would want top-down connections. So those three, I think, would really get us a lot. It would really allow us to build very, very rich models.
JAMES DICARLO: Anybody else want to add to this? OK.
TALIA KONKLE: One of the things we're going-- so again, I format in what's the capacity and what's the mechanism that you want to test. So top-down attention and the ability of a system to right itself with more time so we're building feedback in, and that's steerable through language models. So you can say look, for a cat, and it can modulate your system to then direct attention. So it's, again, a mechanism in the context of a capacity.
AUDIENCE: [INAUDIBLE].
TALIA KONKLE: That's true. But those don't typically have a late stage to-- those are typically local recurrence. So I'm positing a mechanism in the context of a capacity, which I think is a useful format of the answer of understanding that we're trying to seek. Another one that I would want--
CHRISTOF KOCH: It's supposed to be top-down knowledge from language guiding.
TALIA KONKLE: So something like the late stage of an alexnet is kind of a useful interface. It's got access, faces, or hear scenes. So if you can make language point to that, and then you can use that understanding to go back, then you can guide or shape or direct the encoding. So that's the kind of thing we're thinking about, so working from a phenomenon that we know is important and we think that's interesting for how you learn from the world. You don't just look at something. You look at it with a goal in mind or with attention.
That's going to shape the way you want to learn things. So that would be one example of, I think, something that's important. But also, why? In fact, I think you should go why first and then what.
JAMES DICARLO: Nancy.
NANCY KANWISHER: I, mean something I would love someone else to develop-- I can't do this, but I would love to see more biological learning methods because it'd be really lovely to be able to model development. And it's so tempting. You train a network and you're so tempted to say, well, what do the units look like early on? What do they look like later? How does that change over time? And then you're like, oh, this is stupid. We're doing backprop. It's nothing to do with brains.
But it'd be really nice, given all the power we have, as Talia mentioned, with controlled rearing in the models, wouldn't it be great if it was a little bit more like biological learning?
JAMES DICARLO: I think some of the most exciting work that's going on in that space is represented by people on this panel.
TALIA KONKLE: Working on it.
JAMES DICARLO: You're all baking the cake.
TALIA KONKLE: Yeah, exactly.
JAMES DICARLO: OK.
GABRIEL KREIMAN: It's about genetics. It's about development. And it's about the constraints of the physical world. Basically, what's really cool here is that somehow, evolution arrived at a solution that we have implemented, and that not only was evolution having a path towards it, but developmentally, it's actually feasible. So basically, I want us all to reflect a little bit about, is it possible that that's the only solution?
Why are our computational models so close to this evolutionarily achieved and developmentally instantiatable solution? And also, just the concept of, what is the space of all possible solutions, and also the generality, perhaps, with which the brain is actually developing. The fact that there's some primitives and there's wiring, which is influenced by the sensory inputs, even during development, and of course during after-birth input of language and all kinds of other stuff.
And basically, what are the primitives with which we are able to achieve these systems? And how could we perhaps be completely unconstrained by this and achieve more intelligent models? And also, could some of the cell types, for example, just be implementation bugs or quirks that are due to the constraints that evolution and development plays?
And also, there's just a physical instantiation within our skull. And maybe we shouldn't worry so much about modeling those. Anyway, just some provocative--
JAMES DICARLO: OK, that was about five provocative questions. I think in the spirit of-- we can do discussion. Let's take that off. There's a coffee break now. You guys can step-- I want to thank the panelists for all their time, and also for--
[APPLAUSE]