Quest Engineering Team presentation
Date Posted:
December 2, 2022
Date Recorded:
November 4, 2022
Speaker(s):
Katherine Fairchild, Engineering Team Lead
All Captioned Videos Advances in the quest to understand intelligence
Jim DiCarlo: All the work you've heard so far today is work led often by faculty teams, but carried out by postdocs and students, and many of which were acknowledged along the way. Now you're going to hear from Katherine Fairchild. She leads the Quest Engineering Team and as she'll explain to you-- this is one of those enablers that I mentioned at the beginning to create tools to support a number of the Missions, and she'll tell you about an update on how that's going. So Katherine, over--
KATHERINE FAIRCHILD: Hello, everyone. As you heard, I am Katherine Fairchild. I am the Quest Engineering Team Lead. I have the somewhat dubious distinction of being the last person you will hear from before lunch. So if we can all set aside our collective gustatory interests, I would be happy to talk to you about building the science of intelligence.
So, let's start with some basics, like what is the Quest Engineering Team and why do we need one? So you've already heard about some of the Missions in the core human intelligence area. You will be hearing about some more that lie outside of that, and of course, there are future Missions as yet to be incubated. We really think of the Quest Engineering Team as sitting directly in the center of these Missions.
We worked with these Missions to identify, scope, design, architect, construct, and deliver the platforms and systems that are going to facilitate their research. And in the same way that these different facets of intelligence often overlap and synchronize with each other, it is the case that often a Mission will identify something that would be really valuable to them that actually has more general application for the other Missions as well.
It is also sometimes the case that we build something and later on realize, hey, this can connect really nicely with something new that we're building for another Mission. So in this sense, we think of ourselves as essentially integrating across these Missions.
So a large part of what you're going to be hearing about today, not just from me, but from everybody else, is how the Quest is integrative. The Engineering Team sitting right in the middle of that is a core component of that. In addition to integrating, we also think of ourselves as being in the business of benchmarking.
A lot of the discussion you've heard already is about building new models, developing new models, testing them. That's somewhat the glamorous side of intelligence research. Somebody has to be in charge of sometimes the slightly less glamorous side of taking those models and really testing them robustly against other existing models and standards in the fields and saying, what really does this contribute? In what ways and what differences have we produced something new and exciting?
So Quest Engineering Team, what are we doing? Integration, benchmarking, but why? Do we need a Quest Engineering Team? Why can't, for instance, the Missions do this themselves?
Well, the academic infrastructure is usually not really aligned with having a long-term engineering endeavor. So you've got grad students, postdocs cycling in and out on a couple of years' basis. You've got a publishing schedule, you've got exciting cutting-edge research happening that is not necessarily aligned with the idea of unit tests and continuous integration and large-scale support for engineering platforms.
So conversely, you might be asking, well surely, industry, they've got an interest in building all of this stuff. Why do you need a Quest Engineering Team, why can't you just let Amazon and Google and whatnot take care of this? And again, it comes down to resources and alignment.
So in industry, you are often looking to push your algorithms just a little bit further based on performance and accuracy. You do not necessarily have any interest in understanding whether or not there is biological plausibility of those algorithms. So while we've already talked about how at Quest we think that maybe a neuroscientifically motivated research and development would actually assist in developing those algorithms, most-- mostly across industry they're just not all that interested in the biological applications.
So, at Quest, we're putting software engineering best practices towards achieving the goals of scientific intelligence research. And we do want to use those existing tools that come out of industry as much as possible, we don't want to reinvent the wheel, but we want to do it in a way that is driving the science of intelligence forward, and in a fashion that would be well beyond the capability of mostly any one lab to do on their own.
So, who is the Quest Engineering Team? Well, there's me, obviously. And before I introduce the rest of them, I would love to give you just a short personal anecdote. Immediately prior to joining Quest, I was working in machine learning for financial applications at a very large bank.
The moment that I first saw the original announcement for the MIT Quest for Intelligence, I immediately thought, OK, I need to be there. I am a no-holds-barred true believer in the mission of the Quest of fueling new discoveries in AI with new discoveries in natural intelligence and vice versa.
So I submitted my application for an AI software engineer, and next week will mark four years. Over those last four years we have done some really, really genuinely cool stuff, but I am currently the most excited about where we are right now. And I hope that by the end of our time together, you, too, are going to agree with me and be just as excited about what we've got coming in the future.
So joining me on this journey are my excellent collaborators on the Quest Engineering Team. This is the core of the Quest Engineering Team. And we all have varying expertise along a spectrum of cognitive science to software engineering systems.
And it is fairly rare to have that expertise on any one team, so that's part of the secret sauce of the Quest Engineering Team, is that that enables us to interface with the researchers who are really pushing the science forward while also doing all of the programming stuff to actually build those systems and, again, write unit tests.
We also have-- and this is really the other secret sauce of the Quest Engineering Team, some pretty fantastic collaborators. So these two different blocks represent people in different projects that we've taken on for different Missions.
And one of the ways that we execute on these projects is by essentially embedding these researchers into the Quest Engineering Team for the duration of the project, which means that they are helping us not just at this abstract high level, but even writing code, contributing code, developing features, and giving us really important insight and feedback into how they, as the experts in the field, understand and use these products.
So we have this baked in. We don't really all of a sudden go, OK, it's time to onboard you and beta test you and get your feedback. That's part of the entire build. And after we have a product that they've looked for and asked for, we maintain those relationships over time.
And so it's not just internal collaborators at MIT, it's also external collaborators. And of course, as we add more projects and across more Missions, the Quest Engineering Team will simply scale upwards and expand that way. So as resources allow, we are set up to scale towards a more complete science of intelligence.
OK, so we've talked about what, why, and who, but what is it that we actually do? So I'd like to talk to you about where we are, where we've been, and finally, where we're going. So starting with where we are right now, as you heard from the Language Mission representatives, we have been working on a project related to language that we are currently wrapping up. We've been referring to it as brain-score language.
The problem that we face in this field is, OK, we want to compare AI and biological capabilities, but how do we actually do that? Like how do we better understand the differences between humans and AI or biological agents in general-- we heard about monkeys and rats as well-- when it comes to language capabilities?
So over here, you have your experimentalists who are working directly with humans in the lab. Over here, you have your programmers who are working with their models. And what does it actually mean when you get this data from your human experiments and your modeling experiments? How do you make that comparison?
So what this actually looks like? You've got your humans, you've got your models, and you're going to give them some language task. So maybe read a bunch of these sentences and predict what word that comes next. And as you're doing this, you're collecting their neural activations.
Obviously with the network, these are artificial neural activations. The human, this is maybe through fMRI, MEG. And you're also, of course, checking the behavioral output. So how long does it take? Is there a pause after you've read this versus a longer delay with the other ones?
So the way that we've decided to approach how to actually make these comparisons, which at the moment is really up to every lab to decide how they want to compare their outputs when they're making a claim about how their model measures up against human behavior, the way we've decided to do this is to make a system that makes it easy for both the human experimenters and the modelers to easily add their data and models and get those results out. That is the actual comparison.
So here, we have our task, and now you're going to get out a score. A score that tells you how much alike your inputs are. And this is a score of 0 to 1 and it's just measuring, again, how alike your network inputs-- or your network's neural activations and behavioral output are to a human neuroactivations and behavioral output on the same language task. And then we're going to do this for a bunch of different tasks.
And finally, we are going to average all of these tasks together for a score that tells you how well-aligned your models are with the overall system. And the more benchmarks we add, the more powerful the system becomes at being able to act as a true indicator of linguistic intelligence.
Now if you would like to, you can also submit your scores to be publicly displayed on a leaderboard. You can see that this is a mock-up because as usual with a project, the UI is the last to be completed. So on the y-axis, you have your models; across the x-axis, you have your benchmarks. Green indicates closer alignment from the model and the benchmark, red indicates further away. And you can see that at a glance, you're able to see model rankings across the system of how well they compare.
We've got this, which is Brain-Score Language. This is actually an extension of an existing Brain-Score, which is that brain-score.org if you want to check it out live, which we originally were was referred to as Brain Score, but now we're calling Brain-Score Vision, because these instances of Brain-Score represent a fundamental philosophy about-- we can use this platform to help immediately tell us how close our models are to brain and behavior on a range of experiments using different evaluation techniques.
So we have engineered this for easy extensibility. So you might imagine another type of Brain-Score from-- for instance, from some of the talks you heard earlier today, Brain-Score Embodied Intelligence or Brain-Score Navigation. So I'd ask you to imagine what things could be next.
So that's where we are now. How did we get here? I would like to talk to you a little bit about the projects that we've done in the past. This one is substantially different from some of the others you'll hear me talk about because just like everybody else, the beginning of 2020, we were deeply affected by the COVID-19 pandemic.
So we were faced with a difficult problem that leadership was trying to handle in terms of figuring out how campus could be used to ensure adherence to COVID protocols. So there's this trade-off that they were wondering about between wanting to allow access to as many people as possible, especially students, but also minimizing COVID exposure.
So our solution was to help construct a platform that allowed leadership to actually, at a glance get an estimate of how many people were in any building at any given time, as well as hourly usage estimates projected weeks into the future. Why were we the right people to do this? It goes back to that original thing I said about how our team has uniquely both systems and modeling expertise all at the same team-- on the same team.
This is a project that required fusing data across many different sources to get aggregate statistics, and that is a form of modeling that is prevalent in AI and learning systems, so we already had that expertise.
And additionally, we also have that experience of sitting in the middle of multiple different teams of collaborators. This was a very, very wide-ranging effort with many contributors across MIT, and we helped coordinate those efforts. So while this is very much outside of our usual wheelhouse, we were very proud to be able to be part of the effort.
Getting back to our more usual range of projects, we were faced with a problem that was brought to us by some collaborators in the developing intelligence Mission, that stimuli creation is often the most time-consuming and labor-intensive part of running an experiment.
So stimuli is the media, whether that's images or video or audio or text or basically any impetus as shown during an experiment to prompt a reaction from the participant. Often, researchers will spend a ton of energy creating this stuff, and then it basically just languishes on their hard drive, never to be seen again, or if it is, in fact, shared via publication or online or whatever, it's difficult for other researchers to find and repurpose.
So our solution to this was stimulus library, which acts as a central database for researchers to share the stimuli and find shared stimuli. Beta version was recently published at stimulate.mit.edu if you'd like to check that out. So we currently have a student working on populating it with existing stimuli and encouraging researchers to use it going forward.
We are also creating stimuli to go into that. This is just a couple of examples of a suite of basic intelligence tests for both biological and AI agents. And between this and things like what you saw earlier with the "Find the Grape" task that Jim excelled at, you can start to see a future for this library for not just stimuli, but possibly for cataloging entire virtual worlds with embedded rewards.
Finally, we have another problem, again, in the same realm of comparing biological and artificial agents. What if you wanted to compare both a human and a robot navigating through an obstacle course? Well, some of you might recognize this image as being from Stata across the street.
So imagine you could set up an obstacle course on one of those floors. If you've been in Stata, you might make the argument that it's already an obstacle course. But then you just let your robot go and it navigates the obstacle course and you let your humans do it and you can compare the outputs.
However, is it really feasible to do that all the time? Do you just take up a floor of Stata with your obstacle course and not expect things to get moved? Or can you do this-- what if you want to train your robot and you need 10,000 training runs to get really good results?
Well, one option you might have is what you're seeing here, is this is a simulation of Stata. This is generated from a laser data that was collected by a robot going around with the laser mapping the floor's data. And we made it easy so that with basically one command, you can generate an entire 3D world from laser data that is identical to that real-world environment from which the data was collected. And you can localize an agent within it.
So you can do this for nearly any space. And you can start seeing how it becomes much easier to get the data that you need to feed into a benchmarking system like Brain-Score and really be able to compare these outputs.
So, we have talked about where we are, how we got here. Where are we going? So you've seen the Quest Stimulus Library with some Quest-created stimuli, and virtual environments for conducting experiments with both human and AI participants. And I also asked you to imagine with me, if you recall, different instances of platform like Brain-Score for things like navigation, embodied intelligence.
Now I'm asking you to imagine, what if these things were components in a larger system? A system that allowed us to easily find, create, adapt virtual tasks and environments and stimuli, deploy these assets in new experiments for both biological and computational agents, make an apples-to-apples comparison of the internal activations and external behaviors of these agents, allow those insights to drive your model development, and also share them back.
So again, this is creating, sharing, discovering, and deploying stimuli tests and environments, conducting experiments, benchmarking models against neuro, behavioral, and ground truth data.
Benchmarking drives your model development, so your insights on how well your model compares to other models in human behavior, and internal neuro activations and ground truth data will actually inform how you build that model, what your next cycles will look like. And finally, when you have something that you are excited to share, you can actually link those results back to the original experimental assets.
So this is all very exciting hopefully just in and of itself because you can start seeing the possibilities of what we might be able to do if we had such a system. We could possibly learn more about the foundations of cognition and the development of intelligence. We could improve our understanding of the overlap and interplay between sensory modalities and the different facets of intelligence.
And very importantly, we can create integrated computational models that both serve as models of the natural system and accomplish that behavior not just in the narrowly-defined domains that we have right now, but with the generalizability and flexibility that we see in biology. So such a system could drive significant advancements in both AI and our understanding of general intelligence.
So that's us. These are the people who are trying to build that system. Given the scope we are trying to expand our core team, but in the meantime, we are really lucky to have excellent collaborators across the Missions, across MIT, and in the external world who are helping us build it. So we are very honored and excited to be a part of building the science of intelligence.
[APPLAUSE]