Scaling Inference Mission
Date Posted:
December 5, 2022
Date Recorded:
November 4, 2022
Speaker(s):
Vikash Mansinghka - MIT BCS, MIT Quest
All Captioned Videos Advances in the quest to understand intelligence
Description:
Vikash Mansinghka - MIT BCS, MIT Quest
PRESENTER: I'm going to hand it to Vikash, who's going to talk about the scaling inference mission and all the ways that the platforms he and his team have been building have enabled not only what we're doing, but also, I think, many other things inside the quest.
VIKASH MANSINGHKA: So I'm going to be talking about a bet that we think can support all the other missions you've heard about today, which is centered in scaling inference for symbolic generative world models. And I'll try to give you a feeling for the building material that we're using that makes us think that now is a good time to make that bet-- so what's changed in the last couple of years. Then I'll share how we're going to measure progress using two moonshot projects that each have their own feedback loops between AI capabilities and models of natural intelligence.
And then I'll try to explain a little bit more deeply why we think we can actually scale, and how can we scale, not just at all, but with something like the efficiency of the mind and brain, using new platforms for representing models and performing inference-- specifically, the Gen probabilistic programming platform-- and a new approach for implementing probabilistic programs in biologically realistic models of spiking neurons that we're beginning to test as models in neuroscience. And then finally, I'll share a little bit about what we think the impact might be.
This mission has grown out of about 15 years of research towards two facets of human intelligence. The first, of which you heard quite a bit about from Josh just now, is common sense, which almost all adults develop. And in fact, the development happens over the first 18 months of life.
The second facet is data-driven expertise, which I'm representing here through the lens of a project we did with a large philanthropy, a setting where many of us as adults learn expertise about a particular domain on the job, trying to help one another. But there are challenges in bringing our expertise and our opinions into deeper contact with empirical data. So our aim in this mission is to engineer AI systems that match or exceed human performance in both these domains, and to apply the scientific method to those systems in the way that you've heard throughout the day to identify which ones specifically best explain human intelligence at the levels of subjective experience, behavioral data, and neural data.
And I'd like to contrast the approach in this mission with what you've heard over the day as maybe the standard model of AI in industry, which is really large-scale, offline machine learning, where very powerful and impressive models are trained end-to-end, often requiring millions of dollars and years of processing for each model. Now one thing that's important to have in the back of your mind is that the data that's used to train these models in industry is often a mix of real data from the actual external world and data from high fidelity simulations that are resting on human knowledge about how the world works. So that's going to keep coming back later.
Now this approach is undeniably powerful. But I think it's worth reflecting on some of its limits. In this video here, I'm just showing the view from a Tesla, its mental model of the world and what's happening on the Road.
So somehow, it thinks traffic lights are flying towards it because it's driving behind a truck that happens to be carrying traffic lights. So that's, I think we can agree, a false detection at the least. But there's also well known failures of detection for cars and humans, sometimes with fatal consequences, and sometimes in settings where people just put on a t-shirt that's printed with a texture that makes the human wearer invisible to perception systems.
And this brittleness persists despite literally exponentially growing dollar, energy, and compute investments and training on simulated data. And that has led to high profile failures at the industry level. So some of you may have heard recently that Argo AI, one of the more mature autonomous driving projects, was essentially shut down because they found it too hard to tell how much return they would see for another incremental unit of cost put into the machine learning approach.
So it's not that it had stopped improving. It's that they couldn't tell what order of magnitude cost would be needed to get to the next level of autonomy. And Tesla's current and former leadership-- that's Andre Carpathia, who is their former head of AI-- put this colorfully, this kind of problem, using this example of the weird corner cases in the world-- in this case, truck-on-truck-on-truck, which would be hard to put in a simulator.
So let's reflect again on the contrast with humans and nonhuman primates, who can learn to drive in just minutes. Now we may not trust their judgment, but we do trust their vision. And we trust our ability to assess their vision, and actually, also their judgment for the most part. So that reflects degrees of data efficiency, robustness, and explainability that really set a high bar for AI technology.
[LAUGHTER]
I encourage you to check out these videos afterwards. It's a great genre. Motivated by that gap, in this mission, we're focused on understanding how intelligence could produce such remarkably useful yet approximate models of the external world.
How is their structure somehow inferred from their sense data? How is sense made of the sense data in terms of models whose structure and content is uncertain, and often where the laws by which the structure evolves are also uncertain? Intelligent systems can somehow perceive and think in ways that acknowledge all this uncertainty.
And this mission is really focused on first, the material from computer science that's needed to address this problem in a scalable, technically serious way. So computing has given us mature tools for simulating the external world, but not such a mature toolkit for modeling mental simulation engines that are learned. So let me show you a simple example that will hopefully let you see how it's possible to do this.
Here on the left, I'm showing data streaming in one data point at a time. And on the right, I'm showing the source code for a probabilistic program, a simple symbolic generative world model, that the machine is learning all online in real time as the data comes in. We're going to watch it a couple of times.
So initially, there's broad uncertainty about the structure. And the forecasts reflect that uncertainty. But after a few cycles, the system starts to converge on maybe a periodic overlay on top of some kind of linear trend. And you can see that, again, reflected in both the code, if you read closely, and actually in the forecasts.
At a very high level, I'm just trying to give you a feeling for how this building material is different than the building material in machine learning. So on the right, I'm showing a standard image of an artificial neural net, which has large entangled vectors. The learning is really about tuning parameters. And it has to happen offline for massive data.
And on the left, you're seeing online inference to infer the structure of a probabilistic program from data, where the model is small, and abstract, and symbolic, and actually interpretable and editable by engineers. Its structure is learned from data. And it all happens in real time at human speed and human cost.
And this approach turns out, very recently, to have matured enough to be both faster and more accurate than industry machine learning systems on problems such as time series forecasting. So it turns out that the data that you saw was actually airline traffic volume prior to the COVID-19 pandemic. And the crash you're seeing was when flights were grounded.
And what you can see is the probabilistic program was able to detect the crash happened, and then make appropriately uncertain forecasts about the future, whereas the neural network time series forecasting systems make very nonsensical errors. The probabilistic program's also faster in terms of runtime. That'll come back later.
Why now? We really only became ready to pursue the scaling route in earnest in the last two years. I've been working on probabilistic programming and probabilistic hardware for more than 15.
But until about two years ago, I discouraged people from trying to use it. I said we'd be happy to have your help in building it, but if you want to try to use it, you should maybe wait. But recently, we've seen state-of-the-art results in terms of accuracy and performance in domains such as 3D scene perception, common sense data cleaning, and automated data modeling.
Now one indication that this is a real inflection point is that students are now coming to us because they've heard of and sometimes even used or contributed to our open source platforms. And they're writing papers in their first year that might have previously required their whole PhD. Another indication is that industry leaders, most recently Google, have started to fund this effort and contribute engineers to collaborate with us.
Now intellectually, what's going on here? Well, one view is that probabilistic programming enables us to draw on really important ideas of learning from data and machine learning, but to integrate them with the idea of probabilistic inference and generative models and the powerful tools for scaling knowledge that come from the legacy of symbolic programs-- specifically, generality, composition quality, reflection, and compactness. And it's this synthesis of the symbolic, probabilistic, and neural that's just starting to scale now.
And that in turn hopefully gives you a feeling for why I think this might be an opportune moment to really try to scale up building agents that learn symbolic generative world models like the ones you saw in the development mission. Another pivotal example that may help to give you a sense of the inflection point is a recent success in 3D scene perception. So our 3DP3 system, built by Nishad Gothoskar and collaborators from IBM, and many others-- I think he's in the back here-- takes input images and infers the 3D scene structure that includes a symbolic scene graph of the objects, and their contacts, and the shapes of the objects. And that can be used to reconstruct clean depth images of what's out there in the world given the noisy sense data.
Now crucially, the symbolic object models aren't coded by hand. They're learned from just a few images. And the system is also appropriately uncertain. So for example, when learning a model of this mug here, the system knows that it actually doesn't know what's inside the mug because it hasn't seen inside the mug yet. So you can see how this uncertainty of where a scene representation might be useful for physical scene under understanding in the embodied intelligence mission and grounding the meaning of spatial words and natural language.
And it turns out this approach is actually more accurate than machine learning on databases of thousands of cluttered household scenes. And just to get a sense of what's driving these results, let's look at a few examples. So the top row here are input images from that large database I flashed by on the previous slide. And the middle row shows outputs from dense fusion, which is a strong deep learning baseline.
And the bottom row shows 3DP3, which detects and corrects many of the large nonsensical errors that are made by the deep learning system. Now we are currently about eight times slower than the deep learning system, but we're only two years in and just getting started scaling up.
And we've built a team over these years of collaboration that spans AI, computer science, the brain and cognitive sciences, and multiple industry partners, including a team at IBM that's supporting both the scaling inference and development of intelligence missions and a team at Google that's focused on the computer science problems involved in scaling inference and connecting it to economically relevant workloads.
So let me say a little bit about our mission goals, plans, and current prototypes. Our first project-- the name is ChiSight-- is to build a 3D scene perception system that is as robust, learnable, and efficient as the first, second, or 1,000 milliseconds of human perception. And we propose to test it against industry autonomous driving systems.
One strong test we'd like to do is to see if we can transfer its model from Boston to Bombay without hitting anyone or anything. As you can imagine, the data-centric simulation-heavy approach being pursued in Silicon Valley can't produce a technology that's globally deployable. Our industry partners are also excited about opportunities in video intelligence, and also about the possibility of just a cheaper scaling route for computer vision than their current artificial neural networks.
Now the way we're building it is to put these symbolic world models in a feedback loop with the image data, iterating to improve the match between the model and the data until we've sufficiently reduced uncertainty for the task or goal at hand, or determine that we don't know what's going on and should report some kind of error. And it's just worth looking at this in a little more detail because unlike deep learning, this approach allows our systems to know when it doesn't know. So here's a very early prototype from several years ago, where I'm showing depth images on the left and a simple geometric model in the middle.
That's just showing the viewpoint from a camera whose pose is being estimated along with the height of the room. And the system knows. On the right, you can see that it knows it can explain the blue pixels-- that is, the floor in the ceiling.
But it can't explain the yellow pixels for the chairs, and the tables, and the light fixture because those aren't in its model. So it knows that it can explain some of the data, but not all of it. And it's appropriately uncertain as a result.
Now industry-- in this case, Meta-- has actually already furnished data sets that we can use to test the scaling route against industry neural nets-- for example, to infer the pose of very coarse 3D models of objects in a broad variety of scenes. And over the last five years, we've also been through some feedback loops with cognitive scientists and neuroscientists, especially in Josh Tenenbaum's group, who have helped us test earlier versions of the components and compare them to human perception. So you can learn a little bit more about this work and how to make these models neurally malleable at the poster session.
I want to share a little bit about our second moonshot in domain expertise, which aims to address limitations of data analysis that I hope we can agree are maybe widely felt, whether you're in government, or in business, or in science. As Churchill maybe colorfully put it, it's very easy for the assumptions in data analysis to be out of sync with the symbolic world models of human decision makers in ways that make it hard for those decision makers to use those results. And that phenomenon also arises in scientific data analysis.
So we've already shown through past work and actually two past startups that have contributed back to the open source that we can learn probabilistic programs that accurately model a broad range of databases. So here I'm showing clinical trial data from Takeda, genetic data from the Broad Institute, salary surveys of industry software engineers and micro data from the US Census. In each of these plots, there's two colors, one of them showing real data and the other one showing synthetic data from an probabilistic program that we learned.
So I think the fits are pretty good. And that gives us an interesting basis to start building systems that can answer questions in English that are not just about the data, but about inferences about the symbolic model of the world that's behind that data. So let's look at an example that's drawn from our work to try to help create a more diverse and inclusive community in the probabilistic programming field.
So what you're going to see is a question being typed into our prototype saying, enumerate different genders and ethnicities. This is on the survey data for software engineers. And tell me what's the probability that they're underpaid?
And let's use the median, OK? And so then we're going to run that query. And the system is going to spit out a query in a probabilistic programming language that it can run against a learned world model.
But the human can say, oh, wait. Order by the highest probability, and then get a new query back. And at the bottom, there's a new command order by median probability underpaid.
And then that can be actually put into our prototype of ChiExpertise-- and there's a poster about more of the details-- and run. And what that will actually do is use the probabilistic program to score every developer in the database based on the probability that they're underpaid and roll it up by aggregate, and then sort, demonstrating a very simple verification of the intersectionality of salary and equity in the software industry.
So I'm personally excited by the possibilities that this has both in business and also in journalism, and activism, and science, and other areas in civil society. The basic idea, of course, is to use current large neural models to translate from English to probabilistic programs, but use the probabilistic programs to give the reasoning that those AI systems are doing a coherent semantic meaning. And unlike previous generations of expert systems, which many of you may have heard of, here it's all learned from data or synthesized from natural language. And the inputs and outputs are editable by humans.
Now we also think this approach, especially in the time series setting, initially, can be compared fruitfully to human judgments about forecasting data, drawing on empirical work from Laura Schultz's group, and Josh Tenenbaum's group, and some others. But I won't say more about that right now.
So let's now look at the platform technology that's enabling this to scale. So this is maybe one view from the scaling inference mission on the bigger picture of the goals of the quest to understand intelligence in engineering terms. And on the left, I'm showing a version of the computing stack from transistors at the bottom, as you heard from Jim, switching it billions of times per second, that are implementing organized processes that support operating systems, programming languages, applications, and AI systems.
And on the right, what you're seeing is neurons that are spiking, maybe a million times slower and more efficiently than transistors are switching. But they're somehow able to learn to produce symbolic models robustly of all the worlds that humans encounter in their lives and can imagine without being programmed to do so. So there's a big gap to close.
Now you've already seen one technical idea that helps us narrow these gaps, which is that we can learn probabilistic programs that represent the models from data. So we don't have to program them. But that leaves at least two more questions.
How can this approach possibly scale? And if it scales, how could it possibly scale with the level of efficiency of a brain and actually be implemented in and compared to a brain? Let me just say a little bit about this.
So the central idea of probabilistic programming is really to given a new notation code rather than math for models and to separate modeling from inference, analogously to TensorFlow in PyTorch, but for a much, much broader class of models and inference algorithms. One way to appreciate the power of this abstract idea-- I remember when I was a grad student, when like everybody else-- we just knew that offline machine learning didn't work. And we were wrong.
But why were we wrong? Well, in my case, at least one big problem was that the gradient code that I wrote to do the deep learning was often buggy. So when TensorFlow came on the scene and it automated the math, many people, including myself, were able to succeed more consistently.
Now probabilistic programming is starting to drive an analogous transition, but for online inference in symbolic generative world models, which are substantially more complex from a mathematical and computational perspective. And Gen is our main probabilistic programming platform. And to get a sense of where we are on the adoption curve, we're just at the point where we're learning periodically about new courses that are being taught in universities around the world, where people are using our online material to teach the concepts of modeling and inference and probabilistic programming using Gen without talking to us.
Now there's a small handful of courses like that that are happening. But it's three or four, not zero or one. And there are a number of industry partners who have also started contributing financially and in other ways to mature the platform.
And for people in the audience who have more of an engineering bent, I'll say another measure of Gen's maturity as we've just now started to break ground on the Python and C++ versions of Gen that could be adopted by millions of engineers and deployed in production. And here I'm just showing the range of runtime performance that you get from hand-coded C++ to our C++ version of Gen all the way up to a Python version of Gen. And you can see there's some overhead as you move to more productive languages, but it's not too bad. So I think we're at the point where an open source community could really develop and enable a much broader audience to use familiar languages and also get real work done.
Now what about scaling? A key idea in scaling deep learning was to use stochastic gradient descent, which is actually a very simple probabilistic computing idea, where instead of exact gradients, one actually uses noisy estimates that are made from small samples of data. And this turned out to work better than exact gradient descent. It's part of the mystery of why everything seems to converge so smoothly.
It's because everything's a little noisy, and that smooths something out. And that opened up additional scales of parallelism. And maybe most surprisingly, people found empirically that this worked better the larger and deeper the model was.
Now it turns out that when your sequential Monte Carlo code doesn't have bugs-- by the way, sequential Monte Carlo is a kind of analog in a certain sense. It uses an analog of a derivative, but for measures, probability measures, instead of functions. It can scale analogously in all these ways.
And our partners at Google are working with us to map this out empirically because it has implications for hardware and software investments and how they search videos and time series. But I should say, this just explains how this could scale at all, where you poured in a lot of resources. This doesn't explain how the mind and brain actually do it so staggeringly efficiently.
So here I just want to try to give you a flavor. And this is in some ways analogous to Josh's reinforcement learning slide earlier. So the naive scalable sequential Monte Carlo approaches I showed earlier are a very sophisticated data-driven, but still far too random search. And that's not how perception and thinking works. That's way too inefficient.
So maybe if you're Google, you can take generic Monte Carlo techniques, shown in purple and blue here, and run them not for seconds but years until they work. But if you're one brain, you can't use the strategy that works for Google. So how do we get this green curve that scales much, much better?
Well, we actually have meta programs that are self-specializing modular inference algorithms that analyze the structure of the symbolic model and use that to break the inference problem down into small pieces, solve those pieces, and stitch the results back together. And I think our early measurements suggest that this can be thousands of times cheaper than machine learning or other generic techniques. And that's part of how we're going to scale.
We've also been developing platforms for mapping probabilistic programs to biologically realistic models of spiking neurons and actually testing the predictions against data from multiple brain regions and model organisms. There's a poster on this approach outside. But the main thing I want to convey here is that I think it's really exciting that we can take probabilistic programs for modeling physical scene understanding in Mehrdad's lab, or concept learning from some of Josh's work, or for example, 3D prey tracking by larval zebra fish, which is one of the model organisms we're using in our own ChiSight behavioral experiments.
And in all those cases, we can actually automatically synthesize biologically realistic spiking Monte Carlo circuits that show how those probabilistic programs could be implemented in actual brains that spike at 10 hertz, let's say. And I won't go into the details in this talk, but I'll just say that it turns out these circuits predict diverse phenomena in spike and field electrophysiology, synaptic physiology, cortical hodology, different aspects of coding and dynamics.
So in this slide, I'm actually showing one of the little modules for sequential Monte Carlo on the left for one latent variable and how it compares to a canonical cortical micro circuit. And here we're comparing simulated spiking from our model to really beautiful depth electrode recordings that Cicada did from rodent auditory cortex. And you see there's a laminar structure to dense and sparse spiking and relative timing that actually, the model turns out to capture.
It also turns out that larger-scale spatiotemporal phenomena in neural activity fall out as predictions of spiking neural Monte Carlo, implementing sequential Monte Carlo. So this is one that was a total surprise to me. But here what we're showing is that if you just basically time out-- if you take one of our circuit models and you try to detect what its gamma band activity would be, you see the gamma band oscillations. And it actually falls into local-- it's caused by local traveling waves driven at layer 4 in the micro circuit.
So there's a bunch of interesting predictions. And I encourage people to talk to Andrew Bolton, who'll be presenting this in the poster session. OK.
So hopefully now I've given you a sense of what our bet is, why now's a good time to do it, why I think we can succeed at it, how we're going to measure success, and what's the leverage that we have from platform technology. So now let me say a little bit about where I hope this will go and what I think it might mean. So again, let's step back and consider the big picture.
What would it mean if our platforms could let us really scale inference up and flexibility to meet the enormous capacity, the creative power of human intelligence, and down in cost to match the efficiency of biological neurons? So I'll start with technology. So ChiSight could potentially make automatic 3D scene perception as ubiquitous as cameras. And there's significant economic value, and as Jim mentioned, serious risks to privacy and safety.
Personally, I think some of these risks might be better addressed by a university that's advocating for stringent regulation of this technology than by for-profit corporations, at least on their own. Now ChiExpertise, as technology, could potentially empower a large number of humans-- maybe ultimately, billions-- to use data and understand what it means. And we've been field testing with journalists, and activists, as well as economists and philanthropists over the years, and look forward to doing a lot more of that. And our open source platforms are beginning to be funded and contributed to by industry leaders in software, hardware, and semiconductors, as I mentioned earlier.
But I just want to end with where I think maybe the deepest impact could be, which is through shifts in science. So we're betting that we might actually have the tools now to build and test models of 3D scene perception that work computationally and cognitively in terms of accuracy in reaction time data and neurally, going beyond accounts of the first 150 milliseconds to address how human learning is so data efficient and fast and produces models that are so robust. If we can do that or even take a step towards that goal, I think it could help deepen the intellectual stack, the theory stack over here.
So computing has scaled because there's many, many layers of conceptual and mathematical organization in-between transistors and applications. And there hasn't been that convergence vertically in the brain and cognitive sciences. The stack is unquestioned over in that building. But it's still being heatedly debated whether there is one, in some sense, or what it might look like over here. And I think the first second of perception with these tools might help us break new ground on that question.
Our work on data-driven expertise could help us maybe understand the limits of human rationality. Why is it that we can look at the same data and see such different worlds? And maybe give us tools to help us, as Fei-Fei Li put it in her great talk recently, to see aspects of the human world that we want to see but can't, and also aspects of the world that we, as humans, resist seeing clearly.
And personally, maybe finally, I'm just really excited that we can ask seriously from a computational perspective, how is it even possible to have something as general purpose and creative as our faculties for perception and thought that are produced by a 20-watt, 3-pounds machine, especially when it's given the opportunity to develop in human community? Thank you.
[APPLAUSE]