Building newborn minds in virtual worlds
Date Posted:
April 28, 2015
Speaker(s):
Prof. Justin Wood, USC
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract: What are the origins of high-level vision: Is this ability hardwired by genes or learned during development? Although researchers have been wrestling with this question for over a century, progress has been hampered by two major limitations: (1) most newborn animals cannot be raised in controlled environments from birth, and (2) most newborn animals cannot be observed and tested for long periods of time. Thus, it has generally not been possible to characterize how specific visual inputs relate to specific cognitive outputs in the newborn brain.
To overcome these two limitations, I recently developed an automated, high-throughput controlled-rearing technique. This technique can be used to measure all of a newborn animal’s behavior (9 samples/second, 24 hours/day, 7 days/week) within strictly controlled virtual environments. In this talk, I will describe a series of controlled-rearing experiments that reveal how one high-level visual ability—invariant object recognition—emerges in the newborn brain. Further, I will show how these controlled-rearing data can be linked to models of visual cortex for characterizing the computations underlying newborn vision. More generally, I will argue that controlled rearing can serve as a critical tool for testing between different theories and models, both for developmental psychology and computational neuroscience.
ELIZABETH SPELKE: Today's talk is by Justin Wood, who's in the psych department at USC. Justin studies the nature and development of representations of objects and faces and bodies and bodily actions and number. And from your website, I learned also space.
JUSTIN WOOD: Yes.
ELIZABETH SPELKE: Places in the navigable layout.
JUSTIN WOOD: Exactly.
ELIZABETH SPELKE: I didn't even know about that one. He's been doing these studies for a number of years. Most of his published work on this topic comes from experiments that he's done using traditional behavioral methods, either on college students, or [INAUDIBLE] people, maybe. Or--
[LAUGHTER]
Yeah. Or on adult free-ranging monkeys living on gorgeous islands and getting curious about objects and faces and like that. Or on human infants. But when he moved to USC, his work went in two really strikingly new directions. First of all, he started doing studies of chicks, who turned out to be an interesting model system for a number of different aspects of human object numerical and social cognition. Hopefully, not for all of human object, numerical, or for social cognition. But interesting pieces of it seem to be captured by chick models.
But second, so far, all of the work, just about all of the work that's been done on chicks-- like the work that's been done on monkeys and human infants-- has used methods where you parachute in and study them for a little while, small samples through behavioral experiments with human observers. And what Justin has been pioneering is a different way of collecting experimental data, but also, and I think even more exciting, a way of totally controlling the experience of the animal at every moment of its existence, from the instant that it hatches until the instant that you test it.
And doing so, in a system that's got no humans in it, that's fully automated, and that allows you to control really precisely both the experience that an individual animal is getting, and also to test systematically how differences in the experience given to different groups of animals can influence later development. So I hope lots of you will be excited about is. If you are, you should know that Justin is around all week. He's going to be giving a talk at Harvard on Thursday at noon in William James Hall in the seventh floor seminar room on some of these topics.
JUSTIN WOOD: Some of these topics, yes.
ELIZABETH SPELKE: On some of these topics. And he's also available to meet with people either at MIT tomorrow, or well, really, at any point, I guess, [INAUDIBLE] he's at MIT. [INAUDIBLE] at Harvard. So if you get interested and want to hear more about what he's doing, you can grab him or one of us afterwards. But anyway, without further ado, Justin.
[APPLAUSE]
JUSTIN WOOD: Hello. Thank you so much. It's such a pleasure to be here. I want to start by thinking Elizabeth Spelke and Josh Tenenbaum for inviting me here and arranging my visit. And I also want to note that many of the people in this room and at MIT more generally have had a profound impact on how we think about the mind and the brain. And I hope that you'll see that many of your ideas and your inspirations have heavily inspired the direction of the controlled-rearing research that I'm going to tell you about today.
So in general, my lab studies how high-level vision emerges in the newborn brain. And specifically, we study the very first object representation built by a newborn visual system. So to do so, we use a controlled rearing approach. We raise newborn chickens and strictly control their virtual worlds. And we manipulate those worlds in a variety of ways to examine how specific visual inputs shape the development of high-level vision. And across a range of experiments, we're starting to build a detailed input-output map characterizing how specific visual inputs relate to specific behavior outputs in a newborn animal.
I think this approach is allowing us to tackle a range of questions. For example, how quickly does high-level vision emerge in the newborn brain? What types of experiences are needed to build high-level visual abilities? What is the nature of the mechanisms that underlie high-level vision? And how do these mechanisms, how does experience shape these mechanisms over time?
So to give you a sense, a quick outline, of what to expect today, I'm going to start by telling you about a new automated controlled-rearing method that my lab has developed over the last six years. This method allows us to study how high-level vision emerges in the newborn brain with incredibly high precision. Second, I'm going to tell you about research using this method that shows that one high-level visual ability, invariant object recognition, emerges rapidly in newborn visual systems. And finally, I'm going to I tell you about research examining the origins, the developmental origins, of invariant object recognition, asking whether this is a hard wired property of vision, or whether it's learned from experience with natural visual worlds. And together, I think this approach provides a way to tackle some of the deepest and most stubborn questions in cognitive science, by revealing how abstract representations, and abstract thought more generally, emerge in the newborn brain.
So in general, understanding the origins of high-level vision requires answering two general questions. So first, what is the initial state of visual processing machinery at the onset of vision? When a newborn first opens their eyes, what is the nature of the machinery that they use to begin building enduring object representations from high-dimensional sensory inputs? And second, how does experience shape this machinery over time? What are the computations by which the visual system changes as a result of specific visual experiences?
So what counts as understanding an ability? For my lab, and I and for many labs here at MIT, a deep understanding of a phenomenon requires being able to build the phenomenon in an artificial visual system. So one of my primary goals, one of the primary goals of my lab, is to understand the emergence of high-level vision at a level that allows us to build these visual representations in an artificial computer model. With a computer model, we need to explicitly characterize the architecture of the system and the computations that shape this system over time. And this idea, of course, was nicely captured by Richard Feynman, with his beautiful quote, "What I cannot create, I do not understand." And we use this idea as a constant guiding force in our lab when we conduct and design our experiments.
Specifically, we're experimenting with deep learning neural networks. So we take a set of simple visual input. So here, you can see a single virtual object rocking back and forth through a 60 degree viewpoint range. And we give this input to both newborn animals and deep neural nets. We then measure the nature of the representations created by both the newborn animal and the deep learning neural nets to examine whether the computational models and the newborn animals create the same types of visual representations as one another.
In today's talk, I'm going to focus on the controlled-rearing studies that we've performed, which are beginning to constrain the space of computational models that can account for the emergence of high-level vision in a newborn visual system. As I'm sure all of you know, the space of models in the computational science literature is huge. So one of our first goals was to begin to constrain the space of models that could account for the development of object representation and other high-level visual abilities in newborn chickens.
So for example, do we need a computational model that operates over individual images, or sequences of images? Do we need to build in any hardwired features, or is evidence-- or are these features learned from experience with natural visual worlds? By performing controlled-rearing experiments, we hope to narrow down the possible space of computational models that can successfully account for the emergence of high-level vision.
And I think one of the major benefits of this approach is it moves the study of newborn cognition beyond verbal theories. So rather than just saying whether a newborn does or does not have a particular ability, we want to develop a quantitative understanding of how the newborn visual system changes as a result of specific visual experiences. Computational models make explicit [INAUDIBLE] predictions that we can use to assess our understanding of the emergence of high-level vision.
So before I jump into the details, I also want to note that this research endeavour is critical for understanding and being able to build machines that learn about the world in the same way as biological organisms. So understanding how the newborn visual system learns to perceive and understand the world is a necessary step for building artificial visual systems that perceive and understand the world in the same way that we do. For example, how much information, how much knowledge, do we need to build into an artificial computer system? To what extent should the system be self-organizing? What types of visual inputs does a system use to shape its various connection weights? And how do different visual inputs compete with one another for limited processing resources?
So why don't we already know how the newborn brain learns to perceive and understand the world in high detail? What's the holdup? In my view, I think the main problem is we lack detailed input-output patterns for high-level vision. So put simply, we do not yet know how specific visual inputs relate to a specific behavioral outputs in a newborn visual system. And without these detailed input-output mappings, we can't test various computations to examine whether those computations can produce those input-output mappings. And without those computations, we can't build artificial visual systems that can learn to perceive and understand the world in the same way that we do.
Now, I do want to note some wonderful attempts to study the emergence of high-level vision. So for example, there's the beautiful work of Professor Sinha here at MIT examining how blind people who recover sight learn how to see. Similarly, there's wonderful studies with newborn babies, examining how infants, days or weeks after life, begin to segment the world into coherent units. And these studies provide great insights into how the brain begins to understand the world.
However, they don't provide the detailed input-output patterns. For example, once blind individuals learn to see, they immediately confront this rich visual world filled with hundreds to thousands of objects seen in many different contexts. And even a newborn baby just a few days after life is still seeing dozens to hundreds of objects from various viewpoints, under different lighting conditions, and so forth. So it's always possible that any kind of early emerging abilities we see, however early in life, are still learned from a natural visual experience.
So more specifically, I think there are two major limitations on our ability to obtain detailed input-output patterns on high-level vision. So first, most newborn animals can't be raised in strictly controlled environments. And given the work by, for example, Professor DiCarlo here at MIT, showing that even adult visual systems can be shaped in as little as one to two hours of altered visual experience, this is a major problem. Even one to two hours of uncontrolled visual experience may be sufficient to shape a visual system. So this means we don't have direct control of the inputs to the visual system.
The second major problem is that with traditional methods, we can only obtain a few data points from each newborn. And the problem with this is it prevents us from measuring the outputs of a visual system with high precision. Ideally, we need a method that allows us to measure each newborn's first visual object representation with high precision.
So in the first part of my talk, I want to describe an automated controlled-rearing method that my labs has developed over the last several years. And this method allows us to overcome those two limitations. This method allows us to precisely measure both the inputs going into the visual system and the outputs that emerge from the system. And because this controlled-rearing method is automated, it allows us to run many, many experiments. We've run hundreds of experiments so far, obtaining many different input-output mappings across a wide range of different contexts and visual worlds. So we can begin to search for those computations that can explain those input-output mappings.
So the first step in this research program was finding a good animal model. And animal models, of course, provide a critical tool in the investigation of the mind and brain. And there are many notable examples of how the field has used animal models, simple animal models, to understand complex phenomenon. I'd like to suggest that chickens are an ideal and unique animal model for studying the development of high-level vision really for three major reasons.
So first, newborn chickens can be raised in strictly controlled environment immediately after hatching. You can hatch a chick in complete darkness. You can move it to a controlled-rearing chamber that contains no objects or caregivers. And that chick will actually learn to find food and water on its own. So critically, unlike newborn primates, unlike newborn rodents, unlike newborn pigeons-- all good animal models that have been used to study adult invariant object recognition-- there's no need for a caregiver with a newborn chicken. So it's possible to raise a newborn chicken in strictly controlled environments with no real world objects for their entire life.
Second, newborn chickens imprint to objects. And this phenomenon has been well studied over the last 100 years, on both behavioral and neurophysiological levels. And in brief, chickens develop a strong social attachment to the first objects they see in life, over the first two to three days of life. And they'll tend to spend the majority of their time with those objects. Well, this is really useful for us, because this means we don't have to train the subject. We can simply put the imprinted object on one side of a screen, for example, in a controlled-rearing chamber, and if they imprinted to the object and they recognize the object, they should spend the majority of their time with that object. So no training needed.
And third, avian and primate brains have homologous cortical-like cells and circuits. There have been a number of recent high-profile papers showing common structural and functional properties between mammalian and avian cortical processing. So for example, if you look at a wiring diagram of the mammalian six layer cortical circuit, you'll find that the avian brain contains cortical circuits with a nearly identical wiring pattern.
Further, a few months ago, a paper just came out showing that adjacent and connected regions of the avian cortex form a hierarchy of information processing just like the mammalian cortex. And that neurons in the mammalian cortex and the avian cortex exhibit comparable single-level and population-level coding strategies across cortical regions and layers. So together, these results suggest that the avian and the mammalian cortical circuit evolved from a common ancestor of mammals and birds. And importantly for the present purposes, it suggests that controlled-rearing studies of chicks can be used to inform our understanding of the development of high-level vision in animals like mammals, primates, and also potentially humans.
AUDIENCE: Can you say a little bit more like how close to the [INAUDIBLE] these are? Both Nancy and I were just thinking huh, that they weren't that similar. And you mentioned some really recent studies.
JUSTIN WOOD: Yeah. So most of the work comes from Harvey Karten, who has proposed this hypothesis back in the 1960s. And so far, most of the work in his lab has been looking at the wiring patterns between the avian brain and the mammalian brain.
AUDIENCE: Like what level, like [INAUDIBLE] type wiring diagrams, or what kind of level of wiring?
JUSTIN WOOD: So looking at the inputs to various levels of the six layer cortex, so what's projecting to what, where are the long range connections in the six layer cortex, where are the short term connections. And if you actually-- at the end of my talk, I can pull up a wiring diagram showing them both side by side, and it'll show that-- there's a couple notable differences, but in general, they seem to be pretty comparable to one another.
Now, it is worth noting that from a macro architecture perspective, the avian brain and the mammalian brain look really different. So the avian brain is organized in a series of nuclei, whereas the avian brain is organized in a series of layers. So from a macro architecture perspective, they look very different. Also, the chicken brain is very small compared to the human brain, so you're going to have far fewer levels. You're going to have far fewer levels of processing. And give many studies, both with brains and with computers, showing that the more layers that you add to a system, the more abstract representations you get, I've no doubt that humans are going to be able to build far more abstract representations than, for example, a chicken.
What I think is exciting is that given this comparable-- given the comparison between the cortical circuit, this suggests that this might be a developmental model that we can use to begin to study at least how the cortical circuit works. And if we understand how the development of the cortical circuit works in the chick, then perhaps all we would need to do is take that, and then sort of amp it up, so put it in way more layers, add way more power than you would ever get in a chicken brain, and then just see where that can take us. So can that explain some of the amazing things that we end up seeing in mammalian cognition, and also, of course, in human cognition. But again, ask that question again at the end, and maybe I'll fill that in.
OK. So our controlled-rearing method involves incubating eggs in complete darkness, moving the chicks to controlled-rearing chambers, again, in complete darkness, and then turning on the chambers to reveal a preprogrammed visual world. These chambers run 24/7, seven days per week, and continuously record all of the chicks through micro cameras that are embedded in the ceilings of the chamber.
So to give you a general sense of my lab, I want to take you on a short virtual tour. Our lab contains dozens of automated controlled-rearing chambers. These chambers contain no bounded movable objects. And all of the chickens' visual experiences come from preprogrammed virtual object animations projected onto LCD monitors in the cage. So as we fly through, each one of these chambers is a control, each one of these boxes is a controlled-rearing chamber.
And to give you a sense of what this looks like from a chick's perspective, we can open up one of these chambers, and we can fly inside. And we close the door, to make this as comparable as possible to newborn chickens, and we can rotate around the chamber to give you a sense of what this is like from the chicken's point of view. Now, as you can see, this is a very limited virtual world. The main stimuli comes from these virtual objects that were projected onto the virtual worlds. Again, there's no other objects. There's no caregivers. There's no movable bounded objects. All of the visual object input the subject receives throughout their entire life comes from these virtual objects on the display walls.
I want to emphasize that we take extreme pains to control all of the subject's visual experiences. So we refill the food and water containers in complete darkness to avoid exposing the subjects to any extraneous visual experiences. And to do this, we use night vision goggles. So here's a pair of the night vision goggles that we use to do all of this work in complete darkness.
This animation gives a different perspective of the chambers. So on the left and the right, you can see the LCD monitors. And you can see that a different object animation is being projected onto each of the monitors. In the front of the cage, you can see the food and water containers. And it may be a little bit hard to see from this perspective, but they're recessed into ground as if they were wells. And we did it this because we didn't want the food and water to be in object-like containers that were sticking out of the ground. We wanted them to be below floor level, again, to be like wells in the ground, as opposed to objects.
And for food, we used fine grain. We have to feed the chicken something, so we use fine grain. And the critical thing about grain is that it doesn't behave a rigid bounded object. It doesn't maintain a rigid bounded shape. So this is the best that we could do in terms of having to feed the chickens, but also making sure that we weren't presenting them with any kind of visual object input.
And another important feature of these chambers is that we don't need to move the chickens to a different testing apparatus to test their cognitive abilities. They spend their entire lives in these chambers. And we can just project a different object animation onto each one of these display walls and examine which animation they spend the most time with. And if they can recognize their imprinted object across, let's say, changes in viewpoint, or changes in illumination, or changes in size, they should spend more time with the animation of the object they imprinted to versus an unfamiliar object.
And it's worth noting that these chickens are completely convinced by the virtual objects. So here is one actual chicken by one of the virtual objects. And they end up spending overall about 80% to 90% of their time with these objects-- and this includes the time that they're sleeping. So typically, what they'll do is they'll just snuggle right up next to the monitor. Oftentimes, they'll fall asleep next to the monitor. So these chickens are completely convinced by this visual object experience.
AUDIENCE: Is there any temperature gradient effect in there in terms of the environment?
JUSTIN WOOD: Not that we can tell. So we've tried taking a fine temperature and putting it next to it, and we don't tell any difference that way. We also haven't noticed many changes over differences in the overall temperature of the chamber more generally.
So this slide shows actual video footage from a set of experiments we performed. So this video footage shows 46 controlled-rearing chambers. And the visual input is from a bird's eye perspective. And what you'll notice is that-- and also, at any given moment in time, we know the stimuli that was projected to the left and the right display walls, again, at any given moment in time. And we use automated animal tracking software to measure the position of the chickens at a rate of nine samples per second, 24 hours per day, seven days per week, which results in a massive amount of data from every set of experiments that we end up performing.
And critically, we can perform lots of different experiments at once. So rather than having to guess what would be the next best experiment that we want to do to test between hypotheses, we can actually run eight experiments at a time to do a bit more of a high throughput approach on how specific visual inputs end up shaping the newborn animal brain.
And perhaps needless to say, automating the data collection process is extremely useful. So imagine manually coding all of this behavior. It would be a total nightmare. So over the course of a set of experiments, which typically lasts about two weeks, we end up collecting about 15,000 hours of video footage. Yeah?
AUDIENCE: Did you say what does the camera look like from the inside?
JUSTIN WOOD: So the camera is about-- it's about that big. And it's 26 inches off the grounds. And it just sits in a tiny little hole up in the ceiling. So that is another tiny feature that they do end up seeing in their chamber.
I also didn't talk about the cage flooring. So the cage consists of wire mesh. This is another surface that they do have visual access to. And it's supported by thin transparent beams over another black surface. And we need to make the surface black-- this actually does a good job of illustrating this-- because the automated tracking software needs to have a big contrast between the color of the chick and the color of the background. So all of these are things that could shape the visual system, which is a really good point.
So what we're interested in is to what extent can we manipulate the object representation, given that all of these other features of the chamber are held constant. And as I'll show you when I start showing you the data, you'll see that we actually have incredible control over how much we can push around a representation simply by changing the visual input in different ways.
So while I have the opportunity, I just want to emphasize how useful automation can be in studying the development of the mind. So I think there are at least five major advantages to automation. So first of all, it allows us to collect an incredibly large amount of data. So we can, again, over just a two week experiment, having 46 controlled-rearing chambers all going at once, we can collect about 15,000 hours of behavioral footage. And this allows us to study each chick's first visual object representation with incredibly high precision.
Second, it maintains an objective view of events. So oftentimes, when you design studies, you have certain variables that you think would be a good idea to look at. But then once the study is done, you might decide that there are other variables that would have been useful to look at later on. Well, the nice thing about this is that we have a complete digital recording of every single subject's behavior, again, in nine samples per second throughout their entire life. So later on, we can go back and we can mine for variables that weren't initially considered when we initially ran the experiment.
Third, it's incredibly efficient. So when you're actually collecting data, there's no need for researchers. There's no need for coders. You don't need to worry about things like reliability coding, which is something that as developmental psychologists, you have to worry quite a bit about. There's no possibility for human error. All of the coding is being done by a computer, an automated computer program. And critically, there's no possibility for experimenter bias either. All of the stimuli presentation is done by preprogrammed visual experiences. We pre-program all of their visual experiences before we do the experiments, and all of the data collection is, again, is automated, so we don't need to worry about these fairly major problems in psychology, namely, whether humans are going to mess things up somehow, or whether there's going to be some sort of unconscious experimenter bias in the design.
So I think these are major reasons why automation, as much as we can use them within the study of the development of the mind, can be very useful for giving us a new tool for studying how the mind emerges from particular visual experiences.
I also want to emphasize how different this method is from traditional methods that have also use controlled-rearing studies with newborn chicks. So with previous methods, typically what researchers would do is they would imprint a chicken for two to three days of life, and then they would pick that chicken up and they would move them to a separate testing apparatus, and they would measure their behavior from anywhere from six to 10 minutes. And so from that chicken, you end up getting a very small amount of data, just six to 10 minutes of behavioral experiences. So in contrast, we record all of the subject experiences. So nine samples per second, 24 hours per day, seven days per week. And this is useful, because again, it allows us to measure these subjects' early emerging visual experiences, early emerging cognitive abilities, with incredibly high precision. We collect thousands of times more data per subject than in traditional methods.
So we're currently using this automated method to explore a wide range of different abilities. So for example, we're studying object recognition, action recognition, scene recognition, number recognition, and face recognition. And in today's talk, I want to focus on the object recognition work, because we've made the most progress in this domain. And I think it also connects perhaps most closely to some research going on here at MIT.
OK. So in the second part of my talk, I want to tell you about research examining whether one high-level visual ability, invariant object recognition, can emerge rapidly in a newborn visual system. So I imagine most of you are familiar with invariant object recognition, but just to give a quick description, this term refers to the ability to recognize objects across variation in their appearance on the retina.
And to give a concrete example of this, on this slide, we can see nine different pictures. Each of these pictures casts a very different image on your retina. But yet nevertheless, we can all recognize almost immediately that these nine different images are all images of the same object. This ability is called invariant object recognition, because our ability to recognize objects is invariant to changes in viewpoint, changes in size, changes in illumination, and so forth.
So in one of our first experiments examining the origins of invariant object recognition, we raised chicks in virtual worlds that contained a single virtual object, this object, rocking back and forth through a limited 60 degree viewpoint range. And this was the entirety of their visual object experiences. Now, I want you to imagine being raised in such a simple visual world. What kind of representation could you build? Would you only be able to recognize the object when you saw it from these specific viewpoints, or would you be able to build a more abstract or invariant representation that would allow you to recognize the object across changes in illumination, across changes in viewpoint, and so forth? In other words, is the newborn visual system sufficiently powerful that it can take this limited and sparse input, 60 degrees of visual object experience, and build a representation that allows the outputs to exceed the inputs?
So to test this question, chicks were raised for one week in a world that contained the animation I just showed you. So just for simplicity, since it's a little bit warped here, I'll always put the animation up top to make it a little bit more clear. They were raised in this world for one week, and the animation switched walls every two hours. And since chicks imprint objects, we expected chicks to imprint to this object, and to spend the majority of their time with this object when they were able to recognize it.
In the second week of life, we then examine whether chicks could recognize that object from novel viewpoints. So for example, on one side of this screen, up top, you can see imprinted object, but it's presented from a different viewpoint. Here, we can see it's now presented from a front view point range, whereas I'll flip back to the previous slide, and you can see it was initially presented from a side viewpoint range. So on the test trial, if they can recognize this object, even though it's now casting a very different image on the retina, they should spend more time on this side of the cage than on this side the cage, which contains the unfamiliar object.
Now, one really nice thing about virtual objects is we can actually quantify the similarity between the objects. For example, we can measure the pixel by pixel brightness to approximate a retinal image representation, or we can measure the features of the object from a V1 level perspective using a bank of Gabor filters. Now, this is important, because in order to demonstrate that chicks used high-level visual representations to succeed in this task, we need to show that they were actually-- that they couldn't have used more simple retina-like or V1 like representations.
So this also means we need a good unfamiliar object that will allow us to distinguish high-level visual representations from low-level representations. And you'll notice that when these two objects move, they tend to carve out similar spatial temporal patterns over time. And to demonstrate this more clearly, let me just superimpose the objects. And again, you can see that as these objects move, they really seem to carve out very similar spatial temporal patterns as they move. And I want to say that we modeled these objects after objects that were initially created here at MIT, so thank you very much. These are wonderful objects to use in our initial investigation of the origins of invariant object recognition.
This graph shows you that the results of quantifying the similarity between these two virtual objects. So this blue bar shows the pixel level similarity between the unfamiliar object and the imprinting stimulus. So in other words, how similar was the unfamiliar object with what they were imprinted to. And these blue bars show the pixel level similarity between the novel viewpoints of the imprinting object and the imprinting stimulus. So critically, what you notice is that from a pixel-level perspective, the unfamiliar object is actually more similar to the imprinting stimulus than the novel viewpoints of the familiar object were to the imprinting stimulus.
So in any kind of simple model that relied simply on retina-like representations or V1 like representations would not be able to succeed in this task. Rather, chicks presumably needed higher level representations akin to those found in higher levels of the primate ventral stream. And it's worth noting that we find an almost identical pattern when we simulate V1 one like representation using a bank of Gabor filters.
OK. So here's the data from one of our experiments. On the left here, you see what the imprinting stimulus was. So this was the input to this system over the first week of life. And on the right here, you can see how well that stimulus generalized to novel outputs, so generalized to novel viewpoints. And as you can see, generalization was quite impressive. So there were 11 novel viewpoints. There was one familiar viewpoint. And in general, chicks were able to take that little bit of visual object input and create an invariant object representation that could be recognized across all of these different viewpoints.
Here's the data from just a different imprinting stimulus. So rather than imprinting them to the side view of the object, here the object is being presented from a front view point range. And again, you can see a very impressive generalization performance. Across all of these different viewpoints, chicks were able to build a representation, an invariant representation, that was able to generalize across all of these different viewpoints.
AUDIENCE: Sorry, just some details. Each of those bars is a bunch of different chicks, right?
JUSTIN WOOD: Yes. So each one of these bars is the average performance across six chicks.
AUDIENCE: And it's always relative to the same distractor [INAUDIBLE].
AUDIENCE: Yeah, they have a choice of that.
JUSTIN WOOD: Yeah, so we did it two different ways. The first way we did it is we always kept the unfamiliar objects the same, to maximize the pixel level and V1 level similarity between the unfamiliar object and the imprinting stimulus. And just to make sure that chicks didn't just prefer the more novel object, we also matched the viewpoints, and we found a similar pattern, although performance was a little bit less strong. Yes?
AUDIENCE: Did you look more locally on each of these objects, do you see parts that are more invariant across these views?
JUSTIN WOOD: So do we see some sort of systematic pattern?
AUDIENCE: That may actually be not that great for parts of the object. Sorry, pixel dissimilarity. [INAUDIBLE] on that bottom sphere that seems to be pretty rotational.
JUSTIN WOOD: They might be. They might be. So one thing that we want to do in the future is use, for example, bubbles techniques to see what particular features of these objects chicks are actually using in order to recognize these objects. So one nice thing about our method is we can collect hundreds of trials from every subject, so it opens up these new tools-- or not new tools in vision science, but new tools for developmental psychology that we can use to begin to figure out what exactly are they using, which features are they picking up in order to distinguish between these objects.
We haven't done those studies yet, so I can't tell you exactly which particular features they're picking up on. But yeah, that's a good point. I think it's possible that they're building sub-features of objects that are smaller than the entire object. And one reliable sub-feature they may be picking up on, for example, is this particular feature. So that's a good point.
AUDIENCE: So are the error bars across chicks, or are they across objects?
JUSTIN WOOD: Across chicks.
AUDIENCE: So you can't control for that.
JUSTIN WOOD: So we can't control for what?
AUDIENCE: Like if it was across imprinted objects, then you would have a control for-- anyways.
JUSTIN WOOD: I can talk to you about that offline.
AUDIENCE: Did you try static methods [INAUDIBLE].
JUSTIN WOOD: So it actually took us a really long time to figure out how to get a chick to be motivated to approach a static image. We've recently figured out, but I don't have any data to show you about that, because basically, if you just have a static image and you show it to a chicken, they'll be bored by it and they won't spend much time by it, so they way we recently figured out how to do that is you just have to flash the image over and over again, and that seems to continue to engage the chicken. So right now, we're doing studies where we're presenting static images, and that gives us a lot more control about eventually being able to figure out what are those specific parts of the object that they're using in order to distinguish between the objects.
AUDIENCE: I guess I was thinking [INAUDIBLE] I think he mentioned that you cannot learn and [INAUDIBLE] because you don't have motion, essentially.
JUSTIN WOOD: Oh, have we imprinted them to static objects?
AUDIENCE: I guess, you're suggesting that the chicks actually learn the geometric shape instead of the image?
JUSTIN WOOD: So not quite. Why don't you wait until the end of the talk, because I think I'll present a lot of data that might get at these questions. So I might get to it.
AUDIENCE: Just maybe, what is percent correct? Like over a certain period of time, you measured how long the chicks spend time with the imprinted object versus like how, exactly?
JUSTIN WOOD: So we do it two different ways. One way is we just measure the proportion of time they spend with the familiar object versus the unfamiliar objects.
AUDIENCE: Over a week?
JUSTIN WOOD: Over the week. Another way that we do it, in order to get separate trials that we can use for statistics, is for every 20 minute test session, we measure the amount of time they spend with the familiar object versus the unfamiliar object. And if they spend more time with the familiar object, that counts as correct. If they spend more time with the unfamiliar object, that counts as incorrect. So that way, we then have hundreds of test trials that we can use to feed into various kinds of Bayesian and statistical analyses.
AUDIENCE: Do you reverse the sides of the cage in which you put the novel object, I assume?
AUDIENCE: Can we reverse the sides of the cage?
AUDIENCE: You just have the wall that you put the novel object, it's not always on the same wall. I assume that you randomly are--
JUSTIN WOOD: Oh, yeah, yeah, yeah.
AUDIENCE: You changed the mechanism.
JUSTIN WOOD: Yeah, we randomize the side.
AUDIENCE: So there's no spatial orientation factor.
JUSTIN WOOD: Exactly, exactly. Yeah, that's a good thing to point out. Absolutely.
AUDIENCE: Are you just going to go through the conditions in which you said there was this and when you controlled for novelty, are you going to go through those for us at some point?
JUSTIN WOOD: The conditions in which we control the novelty?
AUDIENCE: You said you controlled for novelty, to make sure, because in this case, they were choosing between this moving image and another image that was static.
JUSTIN WOOD: No, no. So when I showed-- let me just flip back, this will probably make it easier. So this was the unfamiliar object, and this object was always moving. So their choice was always between this object and, for example, this object. So they were always moving. Although, as I addressed in previous questions, we're now starting to look at static images, because that gives us more control. We don't have to worry about the confounds that are introduced by movement.
AUDIENCE: It was the same image. So you're rotating the image that they've been imprinted to, and there's another image which is staying the same image on all the trials?
JUSTIN WOOD: So in a previous question, I mentioned that we've done it both ways. So sometimes, we do it where we keep the unfamiliar object the same. Other times we match the viewpoints. So we make sure they're both being presented from the same viewpoint range to maximize the pixel level similarity between the two test objects. And the results generally look the same for both, although performance is a little less good when we match the viewpoint ranges.
OK. So this just shows you the data from another condition, except we imprinted them to a front viewpoint range. And again, you can see that generalization performance is quite good. So just from this 60 degrees of visual object input, they're able to build an invariant representation that can be recognized across a wide range of different viewpoints.
And just to show you another object, here is the-- we sometimes imprinted them also to the unfamiliar object. We counterbalanced which object they were imprinted to, of course. And we found that with this unfamiliar object, they were still able to build a representation that was able to generalize well to all of these-- to most of these different novel viewpoints of the object.
So in this type of study, it's always important to examine whether performance changed across the test phase. So we continued to present novel viewpoints across the one week test phase, So what this graph shows is their performance in recognizing the imprinted object is a function of the number of times they saw the set of novel stimuli. And critically, you can see that performance for the very first time they saw the set of novel viewpoints was nearly identical to the 14th time they saw all of the novel viewpoints. And so performance was stable and robust across all of the test phase. There didn't seem to be any kind of significant learning going on across the test phase.
And this pattern makes sense, because chicken imprinting has a critical period. So after about three to four days of life, this critical period ends, and the imprinted representation becomes, in essence, sort of locked down in the brain. And it doesn't end up changing that much as a result of other visual experiences. And this is really useful for us, because it means that once the critical period ends, we can actually present the subject with hundreds of test trials showing now new novel visual experiences, and it won't change that imprinted representation. So we can study the imprinted representation with high precision. And across all of the experiments we performed, the data looked generally like this. I don't think we found a single study yet where we've noticed significant changes across the test phase.
And I also want to show you that we can obtain precise measurements from each individual chicken. So the red bars show the number of correct trials, and the black bars show the number of incorrect trials. And notice the y-axis shows-- this is on hundreds of test trials, so 50 test trials, 100 test trials, 150 test trials, and so on-- and so critically, every single chicken who was imprinted to this side viewpoint range of this object, and the side viewpoint range of this object, was able to develop a robust representation of the object that could generalize quite well across these novel viewpoints of the object.
So interesting, we also see a little bit of individual variation. So the subjects over here, for example, were able to build a little bit better of a representation than the subjects over here. So this opens up new questions that our lab is currently exploring, such as why are some chickens able to build better invariant representations than other chickens?
So in sum, this study shows that chickens are capable of invariant object recognition. When we raise chicks in worlds that contain just a single virtual object that can only be seen from a limited viewpoint range, they're able to generalize that input to many novel outputs. So in other words, for visual processing machinery in at least the chicken newborn visual system, it appears to be incredibly powerful. Again, the outputs exceed the inputs.
So researchers and vision scientists in computational neuroscience are often interested in things like background invariance as well. So can you recognize objects across changes in background? So we wondered whether chicks would be capable of the same thing. So we imprinted chicks to their imprinted object, that just rotated in various scenes. So they were imprinted to eight different scenes from four different scene categories. And then what we did is we tested whether they could recognize that imprinted object across both familiar scenes and novel scenes.
And we also varied the viewpoints of the objects. So here you can see that the imprinted object is being presented from the same viewpoint range, so just rotating around a horizontal axis. Here you can see that we rotated the object, so now it's presenting views that are 30 degrees different from what they were imprinted to. And here you can see that the object was moved another 30 degree, so 60 degrees in total, so this object is presenting views that are 60 degrees different from what they were imprinted to. So in other words, we wondered whether chicks would be capable of both background and viewpoint invariant object recognition.
So here's an example of what one of the trials look liked. So this particular trial, you can see that the familiar object and the unfamiliar object are both presented on the same background. And in this particular trial, we presented the object across 60 degree changes along the azimuth axis.
So here's what performance looked like. Chance performance was 50%, because there were just two objects, one on each side of the chamber. And as you can see, performance was quite good, whether the objects were presented on the same scenes that they were imprinted to, whether they were presented on different scenes but from the same scene categories-- so for example, if they were imprinted to scenes of beaches and mountains, and then they were tested on scenes of different beaches and mountains. That's what this particular bar illustrates.
And this bar shows chicks' performance when they were tested on different scenes from different categories. So for example, if they were imprinted to beaches and mountains, and they were tested on, for example, city landscapes. Again, we can see almost no difference in terms of how well chicks were able to recognize the object across these viewpoint changes.
And this shows the viewpoint change performance. Again, we can see almost no difference, whether the imprinted object was presented from zero degree changes in viewpoint, whether it was presented across 30 degree changes in viewpoint along the azimuth axis, or whether it's presented in 60 degree changes in the azimuth axis. And there was no interaction between the scenes and the viewpoints. Yes?
AUDIENCE: I don't know if you've tried this, does it change if the background is a movie? Because it sounds like they don't [INAUDIBLE].
JUSTIN WOOD: That's a really good idea. It's actually queued up in our next set of studies. So hopefully, we'll have the answer for you in another couple weeks.
AUDIENCE: I also have another curiosity. Do a blind chicken in the wild get imprinted? Do they follow their [INAUDIBLE]?
JUSTIN WOOD: Do blind chickens in the wild get imprinted?
AUDIENCE: Like does it have to be a visual thing? I don't know.
JUSTIN WOOD: So the imprinting circuit actually connects to association cortices in the chicken brain. So there is a little bit of evidence that chickens actually imprint to auditory information as well. So I wouldn't think a chicken, a blind chicken, would do very well in the wild, but if they did happen to stay alive, my guess is that they would imprint to familiar sounds. So you can actually use this-- one direction we hope to go in the future is to actually use this imprinting method also to explore how auditory information becomes learned in the brain. So do you form, for example, invariant representations of auditory categories or auditory objects the same way that we do of visual information. Yeah?
AUDIENCE: I have a related question to the first question. Which is, is there any indication that the background acting as a scene in any way? So for example, the objects are not in the same perspective as the scene, and is there any reason to believe that there's is a scene-relevant sort of perception that happens at that point anyway?
JUSTIN WOOD: I'm not sure I understand your question.
AUDIENCE: So the question is, so you're saying that they're recognizing these objects over different scenes. That presupposes that these are scene recognition activities that are happening, right? So is there reason to believe, number one, that they are recognizing these as scenes? And number two, if the object is not interacting with the geometric properties of the scene at all, just floating on top of them, is that working against an interpretation of the scene?
JUSTIN WOOD: Oh, yeah. So I use scene just in the way that we generally use scene, just a scene that we're slapping onto an animation. There's a whole range of complex questions about whether chicks end up perceiving something as a scene or not. And in the question period, I can tell you about some of the research we've done. It really depends on how the chicken moves through a scene, in terms of whether they form scene categories, and also how quickly they move through the scene, which you'll see in just a sec, relates very closely to what I'm about to show you.
OK. So far, I've been focusing on invariant object recognition. But another critical feature of object recognition is the ability to recognize objects rapidly. So for example, in this video, the objects are being shown at a rate of six objects per second. And even though this is quite fast, you're still able to recognize each and every one of the objects in these videos. So studies of adult monkeys and studies of adults humans show that those subjects are capable of both rapid and invariant object recognition. And we wondered, would newborn chickens be capable of this same thing?
So to test this, we raised chickens in a world in which an object rotated 360 degrees around a single axis. And again, chickens should have imprinted to this object. They spent one week within this visual world. And in the test phase, we then varied the speed of presentation of these virtual objects to examine whether they could recognize the object when it presented either quickly, or whether it was presented slowly. So in other words, we tried to do chromometry on the newborn chicken brain.
And we presented the images at 42 frames per-- each image was presented for 42 milliseconds, for 125 milliseconds, for 250 milliseconds, and for 750 milliseconds. And each image was followed by one quarter second white screen. So just to give you a sense of what this looks like, here was the 42 millisecond condition. Here was the 125 millisecond condition. Here was the 250 millisecond condition. And here was the 750 millisecond condition. Oh, it actually-- well. So much slower.
So you'll notice that we varied the viewpoint of the object across the successive object images. And we did this for two main reasons. So first, we wanted to prevent the chickens from developing a gradually more precise retina representation across the successive object images. So to rule this out, we made sure that the viewpoints differed significantly from one another. And we also wanted to prevent the chickens from perceiving apparent motion, if that's possible within a chicken brain-- we still don't know at this point. But nevertheless, we made sure that the viewpoints of the object differed significantly from one another by at least 70 to 80 degrees.
We also presented these objects across 30 and 60 degree changes along the azimuth axis, to examine, whether like adult monkeys and adult humans, chicks are capable of recognizing objects not only rapidly, but also across novel viewpoints.
So here's what a test trial looked like in the 750 millisecond condition. And as you can see, both of the objects were presented in the same presentation rate on each test trial. In this particular study, in this particular test trial, the objects were being presented across 30 degree changes along the azimuth axis. And I also want to note that for this particular study, we also controlled for overall brightness, we controlled for the pixel level similarity of the images, and we controlled for the V1 similarity of the images. So recognition performance could not have been based on any of these lower-level features of the objects. Yes?
AUDIENCE: Do these synchronize like that? So that there wasn't like a novel thing over here, novel thing over here kind of ping ponging effect?
JUSTIN WOOD: They were synchronized, though we found that it really doesn't matter if they're synchronized or not synchronized.
So here's what the data looks like. Chance performance was 50%, again, because there were two objects. And as you can see for most of the presentation rates, performance was well above chance levels. And the data looked similar whether there was a 0 degree change in viewpoint, whether there was a 30 degree change in viewpoint, or whether there was a 60 degree change in viewpoint, suggesting that chickens can not only recognize objects across novel viewpoints, they can also do this quite rapidly. Within just, really, a fraction of a second.
So to summarize on part two, I've presented evidence-- oh?
AUDIENCE: Sorry. [INAUDIBLE]
JUSTIN WOOD: Yeah. Just much slower ones.
AUDIENCE: [INAUDIBLE] about 60% of those previously [INAUDIBLE].
AUDIENCE: It's not rotating during those four seconds, is it?
JUSTIN WOOD: So it's presenting those different-- oh no, so it's staying stationary within those four seconds, yeah. So it's a good question. One thing to keep in mind is that these are chickens that are just kind of behaving as they would. And when mom stays-- I often call them mom, because that's what they imprinted to. So when mom stays put for a long period of time and isn't sort of changing viewpoints quite rapidly, that might be a little bit more comforting of a stimulus to be with. That's my best guess. I'm just kind of guessing at this point why we end up getting a little bit better performance for the longer presentation rates. The thing that we really care about is that at least for some of the faster presentation rates, they're still able to recognize the objects.
AUDIENCE: [INAUDIBLE]
JUSTIN WOOD: Oh, I see. Yeah.
AUDIENCE: [INAUDIBLE]
JUSTIN WOOD: This is part of the trouble of testing static images, is when chicken are imprinted to a continuously moving object, that's a lot of information and a continuously moving object that just aren't present in a static image. And I think that's probably what accounts for the little bit of a lack of performance. Yeah?
AUDIENCE: So you talking about moms gets me kind of thinking about some of my uncertainties with all of this. Because I think of this-- kind of the natural way that I want to think of this is more like face recognition, because it's chicks that have their behavioral demands on them are to imprint onto a [INAUDIBLE] specific-- hopefully their mother. And that they can just follow that person around. And they really only need to [INAUDIBLE] imprint onto one person, onto one [INAUDIBLE].
So I guess I'm expecting to see a different object, and if I see something like-- they can recognize more than one object. Whereas here, it's like, OK, they can recognize what I'm kind of reading as a face, because behaviorally, that's not what the demand is. Because they don't want to imprint onto a rock, right? They want to imprint onto their mother's face, and then they want to follow that face around. Is there evidence that they can kind of represent multiple objects with this same kind of invariance that they can represent a face?
JUSTIN WOOD: OK. So that's a really big question, and I'd love to talk to you about it. I realize that it's already 5 o'clock, so unfortunately I'm going to need to-- if it's possible, I hate to do this, but I'm going to shut down questions for a little while, and I'm going to continue on with the talk. And then when I get to the end, I'd love to talk to you guys about all of this stuff in detail. But for now, let me just continue on with the talk. I'd love to chat about that question later.
OK. So far I've presented evidence that newborn chickens are capable of viewpoint invariance. They're capable of background invariance. And they're capable of recognizing objects rapidly across changes in viewpoint. So together, I think these findings provide evidence for two main conclusions.
First, chicks are capable of core object recognition. So like adult monkeys and like adult humans, who can recognize objects quickly and accurately across changes in viewpoints, it looks like newborn chickens are also capable of the same thing. So extensive experience with a rich visual world filled with hundreds to thousands of objects does not appear to be necessary for developing at least the building blocks of core object recognition. Core object recognition appears to be highly conserved across both developmental time scales and also across evolutionary time scales. So in other words, different taxa appear to have at least similar object recognition abilities.
And second, this data shows that newborn visual processing machinery is surprisingly powerful. High-level vision can emerge from very sparse input. I've shown you from just 60 degrees of visual object input-- we've done other studies showing that even three unique frames of an object is sufficient to build an invariant object representation, although that object representation is a little bit less good.
OK. So in the final part of my talk, I want to examine the origins of this ability. So typically, when an ability is found in a young check or in a young infant, the common interpretation of this finding is that it's innate or it's hardwired by genes. But another possibility is that even early emerging abilities, appearing in the first few days of life, might be learned from the animal or the infant's experiences in those first few days. And when that first [INAUDIBLE] study first came out, I received, actually, several emails from vision scientists saying that they thought that was strong evidence for hardwired features in the brain. And after all, these newborn chickens were able to build invariant representations from just 60 degrees of visual object input.
And that was certainly my hypothesis when we first discovered this phenomenon. I thought it was hardwired in. However, we wanted to test this possibility directly, examining whether newborns are learning these object representations from natural visual experience. And I think this is where the real power of controlled-rearing experiments come into play. So with controlled-rearing experiments, we can specifically test whether the development of an ability requires a certain type of visual experience.
So in terms of the object recognition literature, one possibility is that invariant object representations are learned from experience with natural visual worlds. This idea has been around for a very long time in the computational neuroscience literature, where many researchers have noted that experience with a natural visual world might provide sufficient information for the brain to learn how to recognize objects across various kinds of visual transformations. So in other words, natural visual input might spell out to the visual system how objects change over time. So even though natural visual environments vary widely, as we can see in these many different images of scenes, they also have some common characteristics.
And I want to focus on two of these characteristics. So first, natural visual environments change slowly over time. Objects tend to remain present for seconds or longer, whereas the photoreceptor cells in our eyeballs are changing on the order of milliseconds. So the visual system might gradually build up invariant object representations by extracting slowly varying signals from retinal input. Now, the second characteristic of natural visual environments is they tend to change smoothly over time. So when an object moves across the visual field, it doesn't teleport from one location to the other. It moves smoothly across our eyeball.
So to illustrate this idea, consider this helmet cam of this video of a pool player wearing a helmet cam. And what I want you to focus on is the shape the pool table as this pool player-- as this player moves around the table. So we can see the shape of the table changing significantly as the player gets closer and farther away from the table, as they move around the table, and as they enter into different illumination conditions. So critically, this natural visual input is basically spelling out to the visual system how the pool table, the shape of the pool table, ends up changing over time.
And we can visualize this idea that natural visual environments vary widely but not without limits with this two-dimensional graph containing smoothness and speed as axes. So the smoothness axis refers to whether visual environments change smoothly or abruptly over time. And the speed axis refers to whether visual environments change slowly or quickly over time. And if we were to chart the position of natural visual environments, we would find that they would all be clustered in this bottom left area of the graph. Natural visual environments tend to be slow, and they tend to be smooth.
So the question is, does the development of invariant object recognition require natural visual input, this specific type of visual input, smooth and slow input? And so to test this idea, we can systematically manipulate the chicks' visual worlds. We can raise them in worlds that are systematically more abrupt. We can raise them in worlds that are systematically faster. And we can examine whether these unnatural visual worlds end up breaking their object recognition abilities. So by pushing vision beyond its natural bounds, by raising chicks in these impossible or unnatural visual worlds, we can directly examine the types of visual input that are needed to build an invariant or abstract object representation.
So let's first explore whether chicks require smooth visual input to build invariant object representations. So some chicks were raised in a smooth visual world that rotated at a rate of 24 frames per second. Some chicks were raised in a slightly less smooth visual world that rotated at a rate of five frames per second. You can see that it's generally smooth, but now we can see the object features jumping small distances across the retina. Some chicks were raised worlds in which the object rotated at a rate of one frame per second. So now we can see the features are jumping even greater distances across the retina. And some chicks were raised in this scrambled visual world in which we just randomized the order of the viewpoints, making sure they differed significantly from one another.
And critically, in this particular condition, this contained all of the same unique frames of the object as in this condition, in this smooth visual condition. So again, the chicks in this condition, in this condition saw all of the same unique frames of the image. The difference is whether they were placed in a natural temporal order, or whether they were placed in this kind of scrambled unnatural order.
So here's an example of a chicken being raised in an abrupt visual world. And again, chicks received all of the same unique object images as the chicks raised in the smooth visual world. And here's an example of one of the test trials. Both the imprinted object and the unfamiliar object were presented in this abrupt object motion. And I want to note that for a human adult this is just a trivial task. We can easily distinguish between these objects. But what if this was your only experience with objects? Would you still be able to recognize the object?
So here's what the data looks like. Chance performance was 50%. And as you can see, performance was quite good in the smooth visual world. But as performance became gradually and gradually less smooth, we can see that performance started to approach chance levels. So in other words, when chicks are raised in visual worlds that don't contain smooth object input, but contain this abrupt object input, we seem to be able to break invariant object recognition in newborn chicks.
So this data indicates that newborn chicks build invariant object representations from experience with smoothly moving objects. And this idea provides at least suggestive evidence that newborn chicks use unsupervised temporal learning machinery to build their very first visual object representation. So even though the smooth visual world and the abrupt visual world contained all of the same unique object images, only the chicks in this smooth visual world were able to build a good invariant abstract object representation. So when we're looking for computational models to explain newborn vision, we need a model that operates best over smoothly changing images of objects.
So now let's turn to the second property of natural visual environments. They tend to change slowly over time. So to examine whether experience with slowly moving objects is necessary for building invariant object representations, we raised chicks in worlds that changed at various speeds. So some chicks were raised in a world that changed slowly over time. The object rotated 360 degrees every 15 seconds. Some chicks were raised in a really fast visual world, where the object rotated once every second. And some chicks were raised in a medium changing visual world, where the object rotated once every five seconds.
So here's an example of a chick being raised in a slow visual world. Again, just a single virtual object in the world for the first week of life. This object is rotating once every 15 seconds. Here's an example of a chick being raised in a medium changing visual world. Again, this object rotated once every five seconds. And here's a chick being raised in a quickly changing visual world. This object rotated once every second.
And we measured two features of the object representation created by each chick. So first, we measured whether chicks encoded information about the viewpoint of the object, or the spatial temporal features of the object. So one monitor showed the imprinted object rotating around a familiar axis of rotation, which presented familiar viewpoints to the subject. And the other monitor showed the imprinted object rotating around an unfamiliar axis of rotation, thereby presenting novel viewpoints to the subject. So if subjects encoded information about the viewpoints, they should spend more time with the familiar viewpoints than the novel viewpoints.
So we also measured whether chicks build identity representations of this object. So one monitor showed the familiar object, or the imprinted object, rotating around a novel axis of rotation, which presented novel viewpoints to the subject. And the other animation showed the unfamiliar object rotating around the familiar axis of rotation. And this was important, because it maximized the pixel level and V1 similarity between the unfamiliar object and imprinting stimulus. So if chicks can successfully build an identity representation that generalizes across novel viewpoints, they should spend more time with this animation than this animation.
So here's just a quick example of a viewpoint trial. You can see here the imprinted object rotating around the familiar axis of rotation. Here we can see imprinted object rotating around a novel axis of rotation. Again, chicks should spend more time with this animation than this animation if they encode the viewpoint of the object. And here's an example of an identity trial. Here you can see the familiar object rotating around a novel axis of rotation. Here we can see the unfamiliar object rotating around the familiar axis of rotation. So if chicks can recognize their object, if they can build an identity representation that can generalize to novel viewpoints, they should spend more time with this object than this object.
So here's what the data looks like. So chance performance, again, was 50%. There were always two options. And the blue bars show the chicks' sensitivity to viewpoint information, or spatial temporal information, and the red bars show the chicks' sensitivity to identity information, so how well did that object generalize across novel viewpoints. Now, as you'll notice, in this slowly moving visual world, chicks built a really good object representation. It generalized really well to novel viewpoints, and it contained little to no viewpoint-specific information. However, as the visual world became faster and faster, chicks ended up building, in a sense, a really bad representation. The representation was strictly tied to the specific spatial temporal features of the viewpoint of the object, and the representation contained basically no identity information at all.
I also want to make two additional points about this data. So we've done other experiments that show that the nature of the representation created by chicks is determined by the object's speed during encoding, not during recognition. So if chicks are imprinted to a slowly moving object in the first week of life, and then test it on fast-moving objects, they can easily recognize the object, even when it's moving fast. And second, all chicks imprinted equally strongly. So all of the chicks built a very robust visual representation of these objects, it's just that only the chicks in the slow visual world were able to build an abstract or invariant representation of this object.
So this data suggests that newborn chicks build invariant object representations from experience with slowly moving objects. And I think this is evidence for unsupervised temporal learning machinery with a constant learning window. It's at least a qualitative prediction of this kind of machinery, and spike-timing-dependent plasticity is a notable example of a biologically plausible neural mechanism with the constant learning window.
So as an object rotates more quickly, a greater number of photoreceptor cells are going to activate during any given moment in time. So if a machinery has a constant learning window, faster rotation speeds should cause more photoreceptor cells to activate during any given moment in time. So this should cause the input to the system, and also the resulting object representation, to be smeared in the direction of object motion, and become more selective for this spatial temporal viewpoint information.
And we're now running studies that test this idea systematically by raising chicks in fast and slow visual worlds, and then presenting them with static images of objects that are gradually smeared in the direction of object motion. We would predict that chicks raised in fast visual worlds should prefer an object like this, compared to an object like this. And vice versa for the subjects raised in the slow visual world. So we should have the data-- well, I won't have the data for you, but we should have the data in about a week or two.
So it looks like newborn visual systems build invariant object representations from natural visual input. So this visual input must change slowly over time and it must change smoothly over time. So abrupt visual motion seems to break invariant object representations, and fast speeds appear to, in a sense, smear object representations in the direction of object motion. And I want to quickly emphasize that these findings reflect the very first object representation built by newborn visual systems. So I think we're in the exciting position where we can now systematically manipulate the abstract form of the first object representation built by a newborn visual system. Newborn visual systems can be powerful, as we saw in part two of this talk, but they seem to be also subject to some major constraints.
So invariant object recognition is not-- so just to summarize in part three of my talk-- it looks like invariant object recognition is not a hardwired property of vision, but is learned when a newborn sees their first object. And newborns seem to build invariant features from a specific type of input, from the smooth and slow visual object motion. I do want to mention a couple other points. So first, adult visual systems can build invariant representations from both abrupt and fast visual input. So I think these are constraints on newborn visual systems, not on adult visual systems more generally. Indeed, many studies with adult monkeys, adult rats, adult pigeons have given animals static images as input, and these subjects do just fine in terms of building invariant object representations. So again, these seem to be limits on newborn visual systems, not adult visual systems.
The second is that these are constraints on high-level vision, not vision more generally. So for example, in the speed experiments, the chicks in the fast visual world still built a robust representation of the object. They still wanted to spend all their time with mom, so to speak. They just weren't able to build an invariant representation that could generalize across viewpoints.
So although newborn visual systems are powerful, they're also subject to some major constraints. The machinery must receive slow input, and it must receive smooth input in order to build these good invariant representations. And more generally, I think these results are consistent with hierarchical sensory processing systems that become shaped by unsupervised temporal learning. And to test this idea directly, we're now building deep convolutional neural networks that become shaped by unsupervised temporal learning, and we're giving these networks the same visual input that we gave to the newborn chickens in the controlled-rearing experiments. And we're examining, do they build the same types of representations? And we've really just started this work, but our preliminary results show that just like newborn chickens, these networks can build invariant representations from very sparse input. And second, just like newborn chickens, these networks end up breaking when they're provided with abrupt visual input. So we think we're roughly on the right track.
We're also testing other predictions of this unsupervised temporal learning mechanisms. So for example, one key prediction of this machinery that's been tested extensively in Professor DiCarlo's lab here at MIT is that it should be possible to build impossible or unnatural object representations in a system by exposing the subjects to unnatural visual worlds. So in one set of studies, we imprinted chickens to this object. So as you can see, it changes its identity over time. It doesn't maintain a geometrically consistent shape. And we then tested whether they could distinguish that object from novel objects. And we found that chicks were able to build a very robust representation of this object. They could notice changes in color. They could notice changes in shape. And they could even notice changes in color-shape binding. So in other words, a newborn chicken with their very first object representation can already bind color and shape features into an integrated representation in memory. So this suggests that newborn chickens, in a sense, can solve the visual binding problem.
So the work I've told you about today suggests that newborn animals learn to perceive and understand the world by using unsupervised temporal learning machinery. However, there's a very large class of biologically plausible models that use unsupervised temporal learning. So one future goal of our lab is to continue to constrain the possible space of computational models by continuing to build detailed input-output maps that can serve as, in a sense, targets for computational models. Again, one of our primary goals is to examine whether computational models and newborn visual systems build the same types of representations as one another when given the same amount of input.
And I want to very quickly just give you three examples of how we're doing this. So in one set of studies, we're systematically manipulating the amount of input given to newborn chickens. So some chicks are raised in a world in which the object just moves 11.25 degrees. And other chicks are raised in a world in which the object rotates all the way around. So these chicks get 32 times more unique images of the object. And then, to measure the abstract form of the representation, we measure recognition performance with static images and a varying recognition space that canvases views of the objects in 45 degree increments. So the performance of each chick can be characterized as a point, a single point in a 24 dimensional space, with each dimension corresponding to performance on each one of these viewpoints. So we can then examine whether different instantiations of computational models with different parameter settings are able to produce representations that fall in that same area of the 24-dimensional space.
Second, we're raising chickens in worlds that contain various numbers of objects. Some chicks a raised in worlds with one object, some with two objects, some with three objects, and some with four objects. And again, we're examining when a computational model and a newborn chicken are raised in worlds with various numbers of objects, do we see these objects competing with one another in the same way in the newborn chicken brain and in the computational model for these limited processing resources?
And finally, we're trying to push the limit of this imprinting cortical circuit by examining how it behaves in other domains, such as scene recognition, action recognition, number representation, and face recognition. And there's growing evidence-- I realize this is a hotly debated point-- but there's growing evidence for a canonical circuit in the brain. So one exciting possibility is that it may be possible to build many different types of visual representations within the same cortical circuit. So if this is true, then the type of representation formed in a cortical circuit should be a direct reflection of the incoming visual input. And we can test this idea explicitly with a newborn chicken. We can examine what a fresh cortical circuit that hasn't been tracked out by any visual experiences, we can examine what types of representations it's possible to create within this circuit.
So to conclude, I think there are four main ideas for my talk. So first, newborn visual systems can develop high-level visual abilities rapidly from sparse input. Again, even just a single virtual object seen from 60 degree viewpoint range is enough to build an invariant and abstract representation. Second, these high-level visual abilities are not hardwired properties of vision, but are learned rapidly from natural visual input, specifically the slow object motion and smooth object motion. Third, high-level vision appears to emerge from unsupervised temporal learning mechanisms. And fourth and more generally, automated controlled-rearing studies of newborn chicks can be used to constrain computational models of newborn vision.
So to conclude, I want to thank the members of my lab for their very hard work in helping to develop this controlled-rearing method, and also the NSF for funding this work. And I also note that if you want to see more details of this work, you can visit our website at www.buildingamind.com. Thank you very much for your time.
[APPLAUSE]
AUDIENCE: This is really thrilling stuff. Your claim that this shows-- well, what exactly do you mean, and what is the evidence you're referring to when you say that the high-level parts of the visual system are not hardwired? What was this on your previous slide?
AUDIENCE: Go back to your conclusion, so we can interrogate you thoroughly.
AUDIENCE: Your second point, not hardwired properties of vision. So say what you mean by that, and what data you think argue for that?
JUSTIN WOOD: So to potentially be more precise, it looks like invariant object recognition features aren't hardwired properties vision. So that's the extent--
AUDIENCE: You're referring to the slow and smooth findings.
JUSTIN WOOD: The slow and smooth findings, exactly. So when chicks are raised in a world that contains abrupt object motion, even though it contains very clear object features, or raised in the world that spins very quickly, or moves very quickly, they don't seem to build these invariant object features that allows the object representation to generalize to novel viewpoints. The only kinds of worlds in which those features seem to emerge are ones that are natural, ones in which the object moves smoothly over time and slowly over time.
AUDIENCE: [INAUDIBLE] features of their visual representation repertoire or something are not themselves hardwired?
JUSTIN WOOD: Exactly, exactly. So I think there are many innate features of the brain, including the connectivity, how various kinds of areas of the brain are hooked up to one another. Yeah, I'm specifically referring to just the features that are used to recognize these objects across different viewpoints.
AUDIENCE: There's a lot of potential for different interpretations there. You put out certain kinds of learning suggestions, but there could be many other possibilities. I mean, just one is this-- like slow and smooth is properties of image motion that just are good for image motion estimation, right? So it could be that by messing with those things, you've messed with the chick's ability to estimate the motion, which may be necessary for estimating depth. And maybe high-level vision is maybe-- it might mean different things by that. They might mean something like going from a depth map to an object, 3D object representation. And maybe what you've done is interfere with their ability to get to the depth map without actually interfering with, or really necessarily getting at their nature of what an object is, for example. I'm not saying that's true, I'm just saying it seems like-- I mean, I agree with Nancy, this is a really supercool research program, and there's a lot of questions that in particular that part raises, but it seems like there's a lot of possibilities.
JUSTIN WOOD: Yeah, I think those are really, really good points. So I guess two things to emphasize. One is that we're very much sort of just getting going, so a lot of these findings are actually just a couple months old. So I think there's a lot of other studies we would want to do along those lines to flesh out exactly what is being broken by the abrupt motion versus the smooth motion. I agree those are very important things to discover. And I think another thing is that this kind of where computational models end up helping, is that if we do have, for example, a visual system or a computational model that becomes shaped by unsupervised temporal learning, and we see that it tends to behave in the same way as a newborn visual system when given the same amount of input, I think that kind of gets us around at least some of these problems, if we end up finding a computational model-- which we don't have that.
AUDIENCE: I feel like that's half of it. The other half is testing other computational models to see whether they do the same thing, or don't. I guess to see whether you've found a distinctive phenomena yet, that tell apart that one computational model from a whole bunch of others that do.
JUSTIN WOOD: Oh, absolutely. Absolutely. And one thing we're hoping to do-- and this probably won't be online for the next year or two-- is we really want to take all of the input-output mappings, and we're going to create a public database, because we think that this data could be useful for a lot of people who do computational models. So we want to create this, when we do have this large input-output mapping, we're going to, again, create a public database. Any computational modeler in the world can log in. They can access this data, and they see, you know, can they create a better computational modeler that can actually explain these data. And I think it's really when you get everybody involved in this sort of natural selection of computational models, seeing who can account for the data the best, that's when we're really going make major progress.
AUDIENCE: [INAUDIBLE] membership, where you can also then automatically upload experiments that just automatically--
[LAUGHTER]
[INAUDIBLE].
JUSTIN WOOD: I mean, no. That would be great. I'd love to have a little bar showing performance, so people can actually compete with one another.
AUDIENCE: --and here's the thing. [INAUDIBLE]
AUDIENCE: Assuming [INAUDIBLE]
AUDIENCE: Yeah, mechanical chicks.
JUSTIN WOOD: Yes?
AUDIENCE: I mean, with the viewpoint specific to the experiments, there appears to be another kind of invariance there, though, like that is the rotation speed. So it's like, as you mentioned, the recognition phase speed, rotation speed, doesn't matter. Like once the chick is imprinted to a certain speed, there's a generalization still that is not across viewpoints, perhaps, but across angular speed, or something. So perhaps there are different kinds of different dimensions of generalization going on. So if the input is like speedily rotating thing that is sort of outside of the whatever natural expectations of the newborn chick has, and the axis of generalization is not like 3D geometry. I guess this aligns with what Josh was talking about. But more so around the exact form of the thing that could still generalize, like not in the [INAUDIBLE] domain but in the temporal speed domain.
JUSTIN WOOD: So absolutely. I actually would not want to say that they're building a three-dimensional representation of this object. So just to give you an example of data I didn't show you, if you present-- so do you remember the front view and the side view of the object? This is data that's actually just a couple days old. If you present them with the front view and then the side view, so it's completely different visual input, but of the same object, they actually end up building invariant representations of both, but they're different types of invariant representations. So they generalize differently across that 24-dimensional space.
AUDIENCE: But this is something--
JUSTIN WOOD: But if they were building a three-dimensional representation, it should generalize across all of those viewpoints. So I agree with you. I think you're right.
AUDIENCE: I mean, the data points that you just mentioned, could as well be predicted by a model that recovers 3D structure. Because, I mean, after all, 3D recognition is based upon the evidence. So I mean, you may actually end up on different 3D structures, depending on what viewpoints of the image you've got. If you consider some prior biases, whatnot. So I wouldn't say like any of the input here, like data presented here, is against some deconstruction of 3D representation.
JUSTIN WOOD: So I think, again, this is the exciting part where you just let computational modelers compete with one another, and see which models end up doing the best in terms of accounting for the data. And as long as we have a really large detailed input-output map across many different experiments, hopefully we'll be able to figure out which models the best, and which isn't. Yeah, but really good point. Yes?
AUDIENCE: Have you tried testing on a different type of transformation that the chicks were not imprinted on? Like if you show them the rotating objects, then change the illumination.
JUSTIN WOOD: Yeah, so we're just getting to those points. So we've tried size changes, and that doesn't seem to affect the chick much. But the thing with that is that the chick moves back and forth across the chamber, so we do size changes that are greater than the ones that they do see in the chamber, and they seem to do just fine, but they're still imprinted to it, in a sense, as they move back and forth across their chamber. We've just started exploring illumination, and they seem to do just fine across illumination changes. But that data is still really new, and we still need to do all of the counterbalancing and stuff. So don't write home about that, so to speak. Yeah?
AUDIENCE: Can you command uncontrolled aspect in your environment. So for example, the chick can see some-- its feet, right? It can learn a lot about just looking at what happens when you move your feet, right? So how much knowledge are you saying is [INAUDIBLE] from aspects that you cannot control?
JUSTIN WOOD: Yeah, I think that's a great point. And at first, we were really worried about that. So we thought that-- there's even extended surfaces, and extended surfaces do provide a lot of information about how the world's organized. So I think the data just gives the best answer to this, is that the fact that we can manipulate a representation so much, we can basically push it around from being just strictly tied to the spatial temporal input that was coming in to build the representation, versus pushing it to the other end of the spectrum, to this completely abstract representation that doesn't at all seem to be tied to the input that was used to create the representation, suggests that even though they can see their feet, and they can potentially see their wings when they're grooming, that those particular inputs really aren't influencing that much the kind of representations we end up creating. But I think in this work it's always important to keep that in mind. They can see their feet. They can see the grid flooring, so we have to put them on grids so-- well, I won't go into the reasons why. We don't want them wallowing around in their poop, basically. So there's some things that we had to do within these chambers that does give them experience, that experience with grain, and so on. So I think that's always worth keeping in mind. That's a good idea.
ELIZABETH SPELKE: On that note, let's end and thank Justin--
[APPLAUSE]