Spatial Perception for Robots and Autonomous Vehicles: Certifiable Algorithms and Human-level Understanding
Date Posted:
April 23, 2020
Date Recorded:
April 21, 2020
Speaker(s):
Luca Carlone, MIT
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract:
Spatial perception has witnessed an unprecedented progress in the last decade. Robots are now able to detect objects and create large-scale maps of an unknown environment, which are crucial capabilities for navigation and manipulation. Despite these advances, both researchers and practitioners are well aware of the brittleness of current perception systems, and a large gap still separates robot and human perception. While many applications can afford occasional failures (e.g., AR/VR, domestic robotics) or can structure the environment to simplify perception (e.g., industrial robotics), safety-critical applications of robotics in the wild, ranging from self-driving vehicles to search & rescue, demand a new generation of algorithms.
This talk discusses two efforts targeted at bridging this gap. The first focuses on robustness. I present recent advances in the design of certifiable perception algorithms that are robust to extreme amounts of outliers and afford performance guarantees. I present fast certifiable algorithms for object pose estimation in 3D point clouds and RGB images: our algorithms are “hard to break” (e.g., are robust to 99% outliers) and succeed in localizing objects where an average human would fail. Moreover, they come with a “contract” that guarantees their input-output performance. I discuss the foundations of certifiable perception and motivate how these foundations can lead to safer systems, while circumventing the intrinsic computational intractability of typical perception problems.
The second effort targets high-level understanding. While humans are able to quickly grasp both geometric and semantic aspects of a scene, high-level scene understanding remains a challenge for robotics. I present our recent work on actionable hierarchical representations, 3D Dynamic Scene Graphs, and discuss their potential impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction. The creation of a Dynamic Scene Graph requires a variety of algorithms, ranging from model-based estimation to deep learning, and offers new opportunities for both researchers and practitioners.
Bio:
Luca Carlone is the Charles Stark Draper Assistant Professor in the Department of Aeronautics and Astronautics at the Massachusetts Institute of Technology, and a Principal Investigator in the Laboratory for Information & Decision Systems (LIDS). He received his PhD from the Polytechnic University of Turin in 2012. He joined LIDS as a postdoctoral associate (2015) and later as a Research Scientist (2016), after spending two years as a postdoctoral fellow at the Georgia Institute of Technology (2013-2015). His research interests include nonlinear estimation, numerical and distributed optimization, and probabilistic inference, applied to sensing, perception, and decision-making in single and multi-robot systems. His work includes seminal results on certifiably correct algorithms for localization and mapping, as well as approaches for visual-inertial navigation and distributed mapping. He is a recipient of the 2017 Transactions on Robotics King-Sun Fu Memorial Best Paper Award, the best paper award at WAFR’16, the best Student paper award at the 2018 Symposium on VLSI Circuits, and he was best paper finalist at RSS’15. At MIT, he teaches “Robotics: Science and Systems,” the introduction to robotics for MIT undergraduates, and he created the graduate-level course “Visual Navigation for Autonomous Vehicles”, which covers mathematical foundations and fast C++ implementations of spatial perception algorithms for drones and autonomous vehicles.
PRESENTER: This is the weekly CBMM seminar. And robotics is becoming of increasing interest for our Center for Brains, Minds and Machines, especially when it's connected, like in the research, which we'll describe today, to a human level perception and understanding.
I'm happy to introduce Luca Carlone, who is a colleague at MIT. He is professor in aero and astro and is also a PI in LIDS. And he's head of the MIT SPARK lab. And I'm sure he will tell us about it and what he's doing there.
I just want to add that he has Erdos number three, which is pretty amazing. But Luca, up to you. The screen is yours.
LUCA CARLONE: Thank you so much for the introduction, and thanks for inviting me. So I want to start by saying that, first of all, thank you guys for joining the seminar. I hope you and your loved ones are doing well during this difficult time.
And I also wanted to say that this is a true pleasure and a great honor for me to join this seminar from for the Center for Brain, Minds & Machines. You know, many researchers in the center have been a true inspiration over the years for the research that we have been doing. And it's just a great pleasure to lead this seminar.
So as Tommy was saying, I'm Luca Carlone. I'm assistant professor in Aero and Astro, and I'm a PI in LIDS. I'm going to talk today about spatial perception for robots and autonomous vehicles. And in particular, I am going to focus on this idea of certifiable algorithms and human level understanding.
So I will start by giving you a little bit of context. So I created at MIT the SPARK Lab. SPARK stands for Sensing Perception Autonomy and Robot Kinetics Lab. And the mission of the m at least on paper, is pretty simple. The mission is to develop theoretical understanding and practical algorithms to bridge the gap between human and problem perception for what concern autonomous navigation. OK?
So I'm sure that in CBMM, everybody will be familiar with the word "perception." But just since we have like also people joining this call from outside MIT, I will give like a one image explanation of what perception is.
So imagine that you have a robot exploring and observing the real world. The robot is using some onboard sensors to observe the real world. And perception is a set of algorithms and hardware processing the sensor data as well as prior knowledge to create an internal model of the external world.
So in general, you can think that perception is a broad umbrella for a number of challenges in signal processing, such as to the computer vision, status dimension such as localization and mapping, probably sticky inference and machine learning.
In this talk, I'm going to use often the word "spatial perception," just to highlight the fact that in SPARK we are mostly interested in understanding the 3D environment the robot is moving in, rather than just thinking about abstract reasoning. OK? So I will talk about spatial perception.
To understand the importance in robotics of spatial perception, I will give you just a simple example. Picture in your mind a situation in which you have a self-driving car trying to cross an intersection. You can imagine that, in order to cross the intersection safely, the self-driving car will have to realize what are the lane boundaries and the obstacles, figure out the presence of other vehicles, potentially track other vehicles, detect traffic lights and potential vision over the future intention of other vehicles.
So these are all spatial perception problems that need to be solved in order to make sure autonomous navigation of a self-driving car. Of course, this example is about self-driving cars. But it turns out that you can extend the same idea to all sort of robotics applications, and spatial perception becomes a key ingredient of robotics, ranging from domestic robotics applications like the Roomba robots, to factory automation like the Kiva and Amazon Robotics platforms, to infrastructure monitoring and inspection.
And even [INAUDIBLE] the [INAUDIBLE] robotics, such as digital and augmented reality [INAUDIBLE] slide.
So perception is a key capability. And the interesting thing is that, over the last 10 years or so, we have seen a huge boost in performance here. We've seen a really impressive progress on perception algorithms.
So we now have very compelling algorithms for localization and mapping, which you can see like on the top left of the slide. This is an example from the DARPA subterranean challenge in which our team, in collaboration with JPL, has been showing essentially real-time mapping capabilities on different platforms such as the Boston Dynamics' Spot.
In the middle, you can see a very popular example of object direction with yellow. And on the right, you can see an example of xly semantic segmentation of images in which, for each pixel in the image, you have to sign a semantic plus essentially.
So there's been a huge amount of progress. And this of course has also led-- enabled a number of applications. I already mentioned self-driving cars, a number of self-driving car companies, including Aptiv, essentially have been deploying platforms. And they've been flying essentially self-driving cars in different locations.
And just to give one of the many examples in robotics, there are robotics platform such as the Skydio R1 drone, in which essentially perception is truly enabling autonomous navigation.
So a number of opportunities which are promoted, which are really enabled-- being enabled by new algorithms. At the same time, the broad adoption of these technologies has essentially revealed a number of fundamental limits of existing algorithms.
So this is an image that you probably have seen in the past. In March 2018, a self-driving Uber vehicle failed to detect a pedestrian in Arizona and essentially killed a woman because it failed to stop again for the pedestrian. This is one example. Unfortunately, there are a few of them. This is an example on the right of a Tesla Model X in 2018 misclassifying a concrete obstacle on the left of the car and deciding to turn into the obstacle, killing the passenger in this case.
And I'm sure many of you have seen this example on the bottom right. This is of course a stop sign. In nominal condition, a neural network is doing a good job at detecting and classifying this as a stop sign. But if you carefully place markers, these black and white markers on the stop sign, you can essentially use a neural network to think that this is a 45 mile per hour speed limit, which is a pretty bad mistake to make at an intersection.
So there are these catastrophic failures that are unfortunately going to have implications for human life. So in this presentation, I'm going to discuss a few ideas about how to analyze and boost robustness of perception systems.
And I will start by giving you like the two ways of this presentation. So I want to convey in this presentation two basic messages. The first one is that in order to get performance guarantees, let's say a failure rate like the one that we look for in aerospace-- you know, 1e minus 7 failure rate-- we need to rethink current perception algorithms. We are not going to get there small incremental improvements over the current perception algorithms.
And the second message is that we need a theory of robust spatial perception, which is essentially seeing about-- showing how to connect robust algorithms into a robust system. So these two takeaway messages are going to guide essentially the outline of this presentation.
In the first part of the presentation, I'm going to talk about perception algorithms with performance guarantees. And in particular, I'm going to tell you about this idea of certifiable perception algorithms. I will tell you why what I'm talking about. And I'm going to give you two practical examples, being lidar-based object localization and image-based object localization.
Decoupling the second part of the presentation, I'm going to tell you about ongoing work on system level guarantees and real time high level understanding. This is like, you know, really ongoing work. It's a very recent publication. I'm very excited about it, and I'm very happy to share this with you as well.
So let me start with the idea of safety firewall perception algorithms. And before getting into the technical content of the talk, I will start with two disclaimers, just to set the expectations right for this talk.
The first is that the focus here is on certifiable algorithms, not on system certification. So as you know, a number of companies and standard are essentially reinforcing guidelines to certify safety, for example, in self-driving cars or flying robots.
In this talk, the focus is on certifiable algorithms. So we're going to discuss algorithms and how to get formal performance guarantees on the input output of these algorithms.
The second disclaimer is that it is not a deep learning talk, which is the typical talk you hear about perception. But the good news here is that there are plenty of opportunities for researchers who are in deep learning.
So what I'm going to say is not about deep learning, but can have very positive implications and provide opportunities for deep learning.
So let's start. I'm going to talk about certifiable perception algorithms. And the concepts I'm going to present are going to be fairly general. OK? However, instead of just presenting an abstract framework here, I'm going to use object detection and estimation as a running example. So I'm going to tailor the idea of certifiable algorithm to this setup of object detection and pose estimation.
So the problem is as follows. I give you an image, or I give it to my algorithm an image. And the algorithm is in charge of detecting a specific object in the image and also localizing the 3D pose of the object in the image.
In this case, of course, I want to localize where is the car, which for a human is a task which is fairly straightforward. So one potential approach to do this is-- well, it's essentially in two stages.
The first stage is about doing feature detection. So you take the image, and rather than working on every single pixel in the image, you extract features in the image. You can imagine that features, for example, are relevant points. In the image, for example, the corners of the windshield can be maybe headlights, can be the wheels of the car. So just distinguishable area, distinguishable parts of the image.
It's not surprising that in 2020 essentially feature detection most of the time ends up being a neural network. So within a network, we do facial detection.
And that is-- while feature detection is the first stage, there is typically a second stage, which is about model fit in an estimation. Essentially, what happens is that I give you a 3D CAD model of the car I'm trying to detect, and you solve an estimation problem which is trying to fit the 3D location of the car model such that the projection of this car model is fitting essentially the features that are extracted in the image. OK? It's like a fairly standard framework. Feature detection first, and model fitting as a second step.
So what I'm showing here is a pretty idealized set up, is a pretty idealized way of this pipeline. The truth is that typically what you do is that you are given an image, you call your neural network to extract features in the image, and rather than extracting like a bunch of relevant features on the car that you want to localize, you get a bunch of missed detection, whichever are these one in red.
So again, you know, you have a number of inliers, which are things that you actually are able to detect on the car, and you have a number of outliers, which are points that ideally should belong to the car but instead are elsewhere. OK?
And you can see that it is possible also to have outliers for parts of the car which are misclassified. For example, if these points in red, I can say that that's like, you know, a wheel, or that's like headlights. You know, that's an outlier as well. So outliers can be on the car itself.
So the first issue that you have in practice is that deep learning and feature detection in general can fail in really unexpected, unpredictable ways. And the second issue here is that, if you have a bunch of outliers in the feature detection, and I try to apply the model fitting before, as before, if I'm not careful about dealing with these outliers, essentially I get the completely incorrect estimate of the pose of the car.
So here, the yellow silhouette is where the algorithm thinks that the car is, which of course is incorrect. You just can't see it.
So just to really state and iterate on that the second issue is that estimation may feel if there are too many outliers. OK? We have two big issues.
On the other hand, this idea of trying to identify what are the outliers is also offering an opportunity here. So the opportunity is that if we're able to really identify the outliers, essentially the-- I'm sorry. The inliers are essentially revealing how well the model is treating the data.
In other words, if I'm able to detect inliers in this image in the middle, I can essentially identify-- I can understand how well my deep network is doing in identifying features, and potentially I can feedback that information to the deep network itself.
So there is value in essentially understanding inliers versus outliers. So there are these two big issues. Deep learning can fail to detect features and estimation can fail if there are too many outliers. How do we solve this situation? How do we solve object detection?
Well, the first line of thought is that maybe deep learning will solve it all. Maybe deep learning will get to a state in which there will be no outlier and we'll have just perfect detections of features. Well, that's definitely possible. However, if I look at the state of the art over the last two years of, in this case, vehicle detection and pose estimation using deep learning, the conclusion is that we have a long way to go there.
So here is just a single data point from a paper which is from 2019. And what you see in the figure is for a number of baselines. The percentage of cases in which you have correct localization of vehicle, you know, 80% of the time in this case is correct localization of vehicle. And you can see that, essentially on easy data set, we get 80% accuracy here.
And as we go to moderate to hard data sets, we get a success rate which is dropping to 55%, around 55%.
So you know, by just looking at this number-- it's a single data point, but you just realize that we are very, very far from the idea of getting a failure rate which is 1 minus 7. OK?
And just by looking at the trend over the last few years, we can get definitely better. But it's not straightforward to conclude that that, just by incremental work on neural networks, you're going to get a failure rate of 1e minus 7.
So this is just an observation about practical performance. If we talk about that performance guarantees-- so we look at a theoretical set of things-- the situation is not much better.
So we look at work on neural network verification, and we realize that most of the work that has been done is like excellent progress is shown in the same direction. But it will take a little bit to convert to something that is usable for systems, real world systems.
In particular, if you look at neural network verification, there are limitation in terms of scalability, how conservative are the estimates, and the fact that, the perturbation that you typically assume in neural network verification essentially is a pretty bad model for the images.
So with deep learning, the message here is that it is unlikely that incremental progress on deep learning will give us the performance we want. And that's why we said, let's take a completely different look at this.
So if we cannot make deep learning perfect here, if we are going-- we're doomed to have outliers in this spatial detection, can we make the second part of the pipeline ultra robust? Can we make the second part of the pipeline being able to tolerate an extreme amount of outliers and still produce a good result?
So what I meant to present next is essentially a new approach to model fitting which is capable of tolerating an extreme amount of outliers and noise. And, you know, I'm pretty excited about the results. In order to make sure that you guys are on the same page, I'm going to recall mathematically what this model is fitting. I'm trying to minimize the amount of mathematics in the slides.
But you know, I still have to explain what I'm talking about. So model fitting, essentially, if you look at standard estimation, you can think about model fitting as an optimization problem. In the optimization problem, you're trying to figure it out some state that you want to estimate-- for example, the 3D location of the car.
And you are given a number of measurements, yi. In the measurements, for example, in this example can be just pixel detection in the image. And you are given a residual error function, which is measuring essentially how well your estimate x is matching the detection yi.
And of course, you have many detection. You have many pixels that you detect in the image. Each pixel will contribute to the set of measurements.
This is like, you know, very standard framework. It goes back to the '80s. And of course, like with Tom in the audience, I must mention that here I'm showing a setup in which there are a bunch of measurements in the cost function.
Of course, you can regularize it. You can put also priors in the cost function itself. So these are a very standard framework. What happens is that, in presence of outliers, you want to solve this optimization problem, and you get a very good estimate of the x.
OK. Essentially, this basic formulation, which is a least square problem, is of course not robust outliers. So if you solve this optimization problem querying the pose x of the car, this is what the algorithm is going to tell you. Which, as humans, of course you are able to understand this is incorrect.
So the presence of outliers is a problem. And again, here is nothing new. Standard estimation is not robust outliers. What we can do is to just go back to traditional robust estimation and try to make this formulation robust outliers. And one of the potential formulations for outlier robust estimation is the one I'm showing here.
And I'm going to explain you in details-- because you are going to see this equation at the bottom of the slides, at least like three or four times in the following slides.
So let's compare what's going on here. So this is standard estimation, right? You are minimizing the squares of the residual cost function. When you move to outlier robust estimation, we still estimate the x, which is the quantity you really want to estimate. For example, the pose of the car. But you also have binary variables theta.
You can understand that theta is variables that are essentially classifying the measurements into inliers or outliers. So the theta can be equal to 0 and say that the measurement is an outlier. Or it can be equal to 1 and can say that the measurement is an inlier. And the theta shows up in the cost function.
So you can see that if the theta is equal to 1, the measurement is considered an inlier. And this term appears because we have 1 minus theta. And essentially, the cost function becomes the same as the one in the standard estimation.
However, if theta is going to be equal to 0, this term disappears, and this term becomes equal to bar c squared, which is a constant. It's the maximum error that you are willing to tolerate. OK?
So one way to understand this is that you're trying to fit our model. And at the same time, you're trying to classifying inliers or outliers.
For those of you who are more familiar with optimization, you can figure it out essentially from the standard estimation, which is doing least squares.
The one at the bottom is doing what is called truncated least squares. So essentially, there's a quadratic function which gets flat beyond the maximum error. OK? And this is enabled by the binary variable theta.
OK. Robust formulation is something pretty standard. And every one of you who has played robust estimation at some point would say, OK, but maybe we just have to use RANSAC. RANSAC would be like a viable approach for robust estimation.
And RANSAC is going to solve a problem which is fairly close to the one I'm showing on the slide. It's not exactly the same, but in spirit it would be pretty much the same problem.
The issue with RANSAC is that if you go and go like a very simple back of the envelope calculation about the failure probability of RANSAC for increasing percentage of outliers, even for a problem which is the easiest problem-- minimum amount of points, minimum amount of data-- you get this kind of plot, in which you are showing the failure probability for increasing number of outliers. And you can see that RANSAC essentially is doing well in a regime in which you have a small amount of outliers. So let's say 40%, 50%.
But you can also realize that if you go up to an extreme amount of outliers, let's say 70%, 75%, the probability of failure in advance, like we essentially get close to 1.
So essentially, RANSAC in this area is very likely to fail. And the basic question is can we fix? That can we have an algorithm that is essentially able to work in these extreme conditions and is able even to detect when it's failing, when it's not able to produce a good estimate?
This brings us really to the core of this presentation, which is about giving a new perspective, a robust estimation, which is this idea of certifiable algorithms.
So the way you have to think is that, in these estimation problems, we have a state-- two state x, a circle. This is like the unknown ground state that you want to estimate. And we have taken some measurements, yi, of the two states. And we are feeding these measurements, yi, into this optimization problem, which is the robust estimation problem that I was showing in the previous slide.
And an algorithm would be solving this optimization problem to produce an estimate, x. And ideally, of course, we want that the estimate x star is as close as possible to the ground truth.
So what is the idea of certifiable algorithms? Well, a certifiable algorithm is a fast-- meaning it is a polynomial time algorithm that is able to solve outlier rejection to optimality. It is able to solve this optimization problem of optimality in virtually all problem instances, or detect failure in worst case or pathological problems.
Why I mentioned pathological problems? Well, this problem is [INAUDIBLE]. You always said, like you know, some corner case that you will not be able to solve. But in every other case, you want to be able to solve outlier rejection to optimality and get an optimality certificate.
So the algorithm must say, we are indeed-- I was indeed able to solve outlier rejection to optimality.
And then the second property of a certifiable algorithm is that you want what is called an estimation contract, which is saying that, under given conditions on the input, the algorithm is guaranteed to produce an output that is close to the ground truth quantity that you want to estimate.
So the algorithm must provide a contract saying that, if the data and input satisfies some properties-- let's say if there are not too many-- if there are a minimum number-- there is a minimum number of inliers, then the estimating output is close to the ground truth quantity that you want to compute.
So this is like, you know, really like the properties that-- the formal definition of certifiable algorithms. I want to give you just another peak of what a certifiable algorithm is in the next slide, giving you a little bit of a more intuitive take. OK?
So the intuitive take is that we are taking RANSAC. RANSAC, as we said, is a probability of failure which goes up very quickly for a large number of outliers. And we want to convert RANSAC into a certifiable algorithm in which the probability of failure is pretty much flat. And there are a few-- of course there will be few examples of failures, but the algorithm is able to detect when this failure occurs. OK?
So it's very different from RANSAC in the sense that RANSAC will give an estimate anyways. We want an algorithm that is giving us an estimate and is either saying that estimate is correct, or it will say, I'm not sure about the estimate. You cannot trust it.
So if you look at this image, you know, considering the time we are living in, would say that essentially what you are trying to do is we're trying to flatten the curve here. OK? I want to move from failure rate from RANSAC to something that is able to push all the way to an extreme amount of outliers.
Another way to think about this is that, if you think about all the domain of perception problem, which is this set in green, of course there will be some worst case problem that you are not able to solve. We want a certifiable algorithm which is able to solve pretty much whatever is in the green set.
But if confronted with a problem instance which is in the red set, the certifiable algorithm must say, I'm not able to solve this problem, be careful. OK?
And I would argue that that's pretty much the same that happens for humans. If I give you a scene which is pretty easy to solve, you solve perception. You create an internal model of the scene, and you're pretty confident that's correct.
However, if I give a worst case instance, rather than trying to solve that worst case instance, you may just understand that you are not able to solve it and you may decide to stop the car instead. That's exactly what you want to do.
The positive news about this is that, by working on these, we actually realized that the set of worst case problems is actually fairly small. So the algorithms I am going to discuss, you are really going to be able to solve extreme problems with many, many outliers.
So that's the basic idea of certifiable algorithms. The question of course is, OK, that's the idea. Can we actually design a certifiable algorithm? Can we actually implement some algorithm that is certifiable?
And I'm going to show you a couple of examples. I would really minimize like the mathematical side of things. We tried two examples for lidar-based object localization and image-based object localization just to show you that these ideas are actually implementable in practice.
So let's start with laser-based object localization. Laser-based object localization, I'm giving a point cloud, which is the one detected by a lidar. And the goal is to find a non-object in that point cloud.
So you can imagine that I have an object template. Let's say I have the cube in this case. I have a lidar point cloud. And the goal here is to localize, where is the cube in the lidar point cloud? Of course, in my visualization I was too lazy. I'm not seeing points, but I'm showing just a scene here.
As I said, to solve this problem robustly, we should solve something like the optimization problem on the left here. And what I meant to do in the rest of this slide is simply to tailor this general formulation to object localization in point cloud, which is something that typically is called 3D registration in computer vision and robotics.
So you can see that, if you compare the general formulation here with the formulation tailored to object detection in point clouds, these two are fairly similar. The only thing that is really changing is that instead of the general residual error, I am putting this specific residual error that I will explain in a second. And instead of estimating a generic state x, I'm estimating a rotation-- a 3D rotation.
So in this case, just to keep the point simple, I'm trying to figure out the rotation, which is aligning this model, this template, with the lidar point clouds. And you can see that this cost function is essentially measuring how well the rotation is aligning my model, which is a, with the scene I'm measuring with the lidar, which is b. OK? Just measuring the feet between these two.
So how do we solve this problem to optimality? Well just, you know, the one slide overview of what you are doing is that we start with a robust problem, which is the one that I was showing in the previous slide. This is a tough problem to solve. It is nonconvex. It is combinatorial because you have to decide on these binary variables.
And the first idea is to transform these, to manipulate the math to formulate these as a quadratically constrained quadratic problem. Sometimes call it QCQP. It's just an optimization problem, which is as quadratic cost function and quadratic constraints. And I want to remark that the two arrows here mean that these two problems are just equivalent to each other. So we just rewrote the math in a different way.
And indeed, QCQP are still nonconvex and hard to solve in general. But the nice thing about this is that, after we get a QCQP formulation, we can design-- it's very easy to design a convex relaxation, which is instead converting this hard optimization problem into something that is convex.
For those of you who don't know what this convex, you can think that convex just means, this is a problem which is easy to solve. You have solvers that can just solve this with this kind of optimization problem.
So let me remark again that here is a double lateral, in the sense that this problem is equivalent to the QCQP. But you can see that instead, within the second and the last box, there is just a single arrow. In the sense that, potentially, the convex relaxation is not really exactly solving the QCQP.
I guess one of the key contributions that we are proposing is that we are designing convex relaxation, which indeed, in most of the cases, are able to solve exactly the original QCQP.
Or in other words, we can have theorems which are allowing us to go back and to ensure that the convex relaxation is solving exactly the QCQP.
The theorem says something like this. It's a bit of math that is-- you know, it's not like super deep. It's saying that, if we solve the convex relaxation in the solution, this start of the convex relaxation is Rank 1.
Then this start can be factored into x transpose x. And x is the solution of the original problem that we started with, the QCQP. In other words, under this condition, in which the z star is ranked 1, we're able to solve exactly an intractable problem which is outlier rejection.
The additional observation here which is pretty nice is that, in practice, the observation is that, for the relaxation that we design the solution of our convex relaxation pretty much always as rank 1. Except in very rare worst-case instances. So the relaxation in most of the cases, essentially, is solving the problem exactly.
I also mentioned that the final algorithms must provide what is called an estimation contract which is saying when the algorithm under which condition on the input algorithm producing a reasonable output. Of course, if I throw just garbage sharing input, the algorithm has no way to figure out the reasonable output, so I should enforce some assumption on the measurements and input to make sure that the estimate makes sense.
And I'm going to be sending these slides, the estimation contract for around the algorithm that we designed for object detection in point clouds, which we call TEASER++. TEASER++ is the name of the algorithm, which shows a nice acronym behind it.
So the type of estimation contracts that we have a fairly unique robust estimation of theorems, as one that I'm showing in the slides. If the measurements contain at least three noiseless inliers. If the outliers are non-adversarial, and the certificate to optimality holds-- meaning that, you know, the rank of the convex relaxation is 1-- then TEASER++ recovers the true pose of the object we are looking for.
The second is even stronger, says that if you have adversarial outliers, we can still get pretty strong results if the number of noiseless inliers is larger than the number of outliers. So we have essentially the number of inliers is greater than the number of outliers plus 3. And the certificate of optimality holds. TEASER++ is able to recover the two object pose.
So these, to the best of our knowledge, is the first time which you can see in a problem-- not even a problem with outliers. It's the first time you can see under which condition you actually can make a good estimate in the presence of outliers.
OK. Some of you maybe like the math. But the truth is that, at the end of the day, you want to see these working in real life. So let me show a number of exciting results which I believe again are pretty unique. We're very excited about them, about how this algorithm that is called TEASER++ is working on the problems.
So what I'm going to consider that here is a fairly standard data set. And I'm going to use FPFH as feature detection. So these are not deep learning. These are just [INAUDIBLE] features.
And here, just as qualitative results, I'm going to show the result of detecting a cereal box in this case in a 3D point cloud. And, you know, this is quite typical results. You can see that essentially the algorithm is able to guess, in the right way, where is the cereal box, and is able to guess in the right way, in the presence of a huge amount of outliers. So you can see that the outliers here are anything from 95% to 97%, which is a set up in which RANSAC could definitely--
To convince you that this is true in general, I'm just providing the statistics. And here, I'm going to compare a number of approaches. And I'm going to evaluate the translation error, which is how well I'm able to localize the objects for increasing outlier-- percentage of outliers.
So here, it's look at the scale. The scale is going from 95% outliers to 99% outliers. 99% outliers means that there is a tiny portion of inliers. Everything else is garbage.
So here, I'm comparing TEASER and TEASER++, which are two variance of the approach that we are proposing against a state of the art fast global registration, [INAUDIBLE] which is another state of the algorithm. And I'm comparing two version-- we are comparing two versions of RANSAC. One is called RANSAC 10k. It's doing 10,000 iterations. And the other one is RANSAC one minute. OK?
So what happens if you let RANSAC execute for an entire minute? Which is like an eternity for these kind of problems. Well, it turns out that this is an error. So a smaller error is better. You can see that RANSAC essentially-- RANSAC times 10,000 iteration is starting to feel very early. RANSAC one minute is starting to fail for 99% outliers. And pretty much everything fails a 99% of outliers, while the proposed approach, which are called TEASER and TEASER++, already have very good localization error. So this is like essentially 1 centimeter localization error.
To convince you guys that you can solve problems which are non-trivial, I will also show a simple example here. The Stanford Bunny is a popular benchmarking data set. So this is the data set. And here, I'm showing it on sample version of the data set plus 80% of outliers. So I'm adding just the 80% of random points around the bunny.
So as humans, we can stare at this image. But I'm guessing that a good number of you essentially is not able to understand where is the bunny in this scene. If I feed this scene into TEASER++, TEASER++ is able to detect the bunny and is able to find the set of inliers which is essentially supporting the idea that the bunny is there.
And remember that for TEASER++, this is a fairly easy set up. TEASER++ is able to work up to 99% of outliers. Here, I'm just showing a setup with 80% of outliers.
What experimental results? So in testing these with all type of data sets, this is another type of data sets. It's called 3DMatch. It's about scan matching rather than object detection. And here instead we're using deep learning correspondences.
I don't want to spend too much time on this. But essentially, in the table, you are comparing the success rate across different benchmarking scenes, comparing RANSAC, TEASER++ and a version of TEASER++ which again is discarding things that are not certifiable.
So without spending too much time on this table, you can see that there is a gap between RANSAC and TEASER++ which is around 8% gap. So in 8% of the cases, RANSAC is failing completely, and TEASER++ is giving a good estimate, which is within 30 centimeter of the ground truth estimate.
If you guys don't believe me, you can just go-- here is an open source, very good open source implementation of TEASER++ in C++ [INAUDIBLE]. It's just there. Like you can download and test on your problems.
So this was about object localization in point clouds, or lidar-based object localization. Let me move briefly to the second set up, which is image-based object localization.
And the nice thing here is that [INAUDIBLE] set up is not very different. If you have to do certifiable object localization and image, again, we start with the same general formulation of outlier rejection. And the only difference is that right now we are observing a 3D object in a 2D image. So essentially, the math, the expression of the function r, will change a little bit.
So we'll show that in the bottom of the slide-- I don't want you guys reading too much, the mathematics of this optimization problem. But essentially, again, like the bottom line is that we are replacing the residual error ri with the expression which depends on the quantity we measure. So this essentially is saying seeing that you measure some projection of the 3D object onto the image.
Again, not important to look at the math. Unfortunately, what happens is that in the previous part of the presentation I said, OK, we manipulate these expressions to obtain a quadratically constrained quadratic problem.
The issue is that, in this case, even if you manipulate the expression, you will not be able to get the quadratic optimization problem. You can see that there are products or variables essentially which are higher order. They are not for that.
So what I'm going to show in this slide is the strict generalization of what I said in the previous case, which is a very general way actually to get probably optimal outlier rejection. And the basic idea is, again, to start with a general formulation of outlier rejection.
The difference with respect to before is that, instead of going to a QCQP-- instead of going for a quadratically constrained quadratic program-- we are going to go for a polynomial optimization problem, which is [INAUDIBLE] optimization problem in which all the functions involved are polynomials.
And then what we can do is that we can still do a relaxation, but you are going to use just a more powerful hammer. Instead of [INAUDIBLE] relaxation I was talking about before, we are going to something that is called the Lasserre hierarchy, which is a very general way to do convex relaxation for polynomial optimization.
The interesting thing here is while the mathematics are typically more general than what I mentioned before, the theoretical [INAUDIBLE] are pretty much the same. So there is a theorem saying that if the solution of the convex relaxation is rank 1, then the optimal solution of the problem we started with can be computed from the solution of the convex relaxation.
So in other words, if the rank 1-- the solution of the convex relaxation is rank 1, we will be able to solve for optimality, outlier rejection. And there are a number of good news that I will go through quickly.
The good news is that this is not my result. It's from more standard results from Lasserre hierarchy, essentially saying that, under my condition, the hierarchy site when the relaxation order is high enough, relaxation order is something that is controlling the size of the relaxation.
So essentially, Lasserre is saying that, if you increase the size of the convex relaxation enough, you always get something that is solving the original problem you started with.
And there are two other amazing news is that we observe that hierarchy produces a tight relaxation, a good relaxation, also at low order. So for small size of the relaxation, we can already get optimal solution from this relaxation. And there are a number of things that we can do to speed up computation, which I will not have time to discuss in this presentation.
So that's the theory behind it. But the results are, I would say, pretty amazing here. I'm going to show experimental results for the FG3DCar beta set containing like 300 images of cars, with a corresponding CAD model for the car.
And here, I'm starting to show qualitative results in which I show a number of contribution, like CVPR [INAUDIBLE] or RANSAC, comparing performance for the case in which you have 70% outliers. And you can see that, you know, this is a more robust approach and is [INAUDIBLE] the car.
RANSAC is getting a little bit closer but is still failing to detect the pose of the car. The proposed approach is spot on. We're able to detect what is the car in the image.
And I have a number of these kind of results. Here I'm showing, for example, in the image in the top part of the figure, I'm showing what happens for 40% outliers, which is pretty small. And I'm comparing three approaches.
After an initial robust-- convex robust base line approaches, shape sharp is the approach we propose. You can see that for 40% outliers, all these approaches that are working reasonably well. There is-- they are doing well.
If we move to 70% outliers, you can still see it's-- you're seeing that the baseline approaches are starting here detecting where the car is in the image. And instead, the proposed approach is still spot on. It's still giving a good estimate as to where the car is in the image.
And of course, plenty of statistics showing that, for increasing amount of outliers, shape sharp, which again is based on this idea of convex relaxation and Lasserre hierarchy, is able to get a much better estimate of the location of the car.
In this case, I'm just showing the rotation error for the three techniques. And you can still see that, for increasing number of outliers, the approach is staying around, you know, 1 of 2 degrees-- 2 or 3 degrees error. So it's a pretty good estimate.
And we'll have a number of these figures is pretty repeatable results. I'm very excited about this, because essentially we have extreme performance in very, very difficult instances. And again, you can just look at the bottom line at the bottom of this plot. For 70% outlier, baseline approach are giving very good results. And shape sharp is able to get the right location for the car.
So this concludes the first part of the presentation. The second part will be much shorter. I just want to throw a number of ideas about getting system level guarantees and real time higher understanding. And I'll explain you why these two things-- high level understanding and system level guarantees-- are in the same title here.
So you can imagine that so far in the first part of the presentation, I told you a very simple message. I told you that robust perception requires that our model robustly fits the data and the priors we are given.
So if you're given an image, I just try to localize the car by fitting as much as I can my model to the data I'm provided with.
Well now, you know, the question is, is that enough? Right? So in other words, can we really certify image-based object detection? Can we guarantee that the image-based object detection is working well?
I would argue that it's not an easy task. It's not easy to do image-based-- to certify correct operation of image-based object detection. And to convince you of that, I will show you just a simple video.
So what happened here, these are kind of popular like optical illusion. What happened here is that at the beginning of the video, you are pretty sure about the location of the concrete posts on the street, right? But later, when there was a change in perspective, essentially you understood that your previous guess was wrong. And instead, the correct configuration of the scene was this one.
But frankly, there was no way-- from the original perspective, there was no way for you to figure out that your guess about the traffic post was incorrect.
So the basic message here is that truly robust perception requires reasoning in 3D. So if I want to do 2D perception, 2D perception is doomed to fail. There is no way you can resolve reality just by looking at an image.
Just proof by contradiction. Just a simple proof. These are a number of things in which you can throw any algorithm you want for object detection. And the algorithm essentially will be very, very confused.
So in this case, what you are looking at is a car painted on a van. And of course, if you try to apply the algorithms that I discussed in the first part of the presentation to this problem, I will get them a very nice estimate for the car. Just because, you know, my model will perfectly fit the data there. It will perfectly fit the pixels.
And other example of this one-- you know, from 2D image, there is no way I can distinguish reality from a picture of reality, or from a reflection of reality. So this is a misdetection detection of the car-- like, you know, being instead an imposter. And this is a misdetection of pedestrians. Being instead, I think these are either reflection and that's a sticker on the car itself.
So again, 2D perception is doomed to fail. And we must reason in 3D. But the question again, is that enough? Is reasoning in 3D enough to make sure that you are confident of our work model?
So in the previous part of the presentation I told you, OK, I have like a 3D point cloud. TEASER++ is able to correctly fit the bunny to this scene. And you trust me, right? You trust that essentially TEASER++ is doing that right job here at fitting the image. Fitting the cloud here.
However, if you rotate-- if you look at this point cloud and you rotate it around, there is really no way to understand if the detection for TEASER++ is good enough. There is no way you can detect.
So in other words, if you consider objects in isolation, it is very tough to conclude on the correctness of our detection. Instead, if I play this video and I ask you, do you think that this detection is actually a car? Probably you are very confident that you understand the scene and you can also validate that the object that has been classified as a car is very likely to be a car.
So in other words, as humans, we don't take objects in isolation, but we always reason in terms of, you know, the entire scene and the relation among objects. So which brings me to the other point, which is truly robust perception requires reasoning over spatial temporal relations among objects and thinking about the plausibility, and the geometry, the semantics and the physics of the scene. OK?
And you can see that, for example, in this picture, if you start reasoning in terms of the physical plausibility of the scene, you can conclude it is impossible that a car here is essentially occupying same space as the van.
So we'll spend pretty much the last four, three slides telling you about the work I've been doing to address reasoning in 3D and to address reasoning over spatial temporal relations. I will just show a couple of videos. This is a work in progress, but we're very, very excited about it.
So when thinking about reasoning in 3D, over the last two years, we have been trying to address this issue of trying to reason about objects and semantics directly in 3D. And we have been working on perception algorithms that can essentially model the geometry and the semantics directly in 3D.
So what I'm showing here is what we call the Kimera, which is an algorithm for real time 3D mapping and 3D understanding. And what you're looking at here is that Kimera, essentially our algorithm is telling us input images in a 2D semantic segmentation of the image. This can be from a deep neural network. And it's producing a 3D model in real time of the world.
But you can see that this 3D model is essentially a 3D mesh, capturing the geometry of the scene and also capturing, through the colors, the different semantic labels of the scene.
For example, I don't know, yellow is desk, green is chair. So this essentially is a tool to reason at the same time over geometry and semantics.
So the opportunity here is that Kimera is a first step in essentially thinking and doing reasoning directly on 3D over objects. In the other-- if you guys are interested in this line of research, there are clear opportunities that we use the source code for Kimera online. It's a very good and fast multi-threaded code and can get you at least kind of demonstration in no time. You can just execute it on just a bunch of images and inertial data, and you can get the 3D model which has both dramatic information and semantic information.
I'm a little bit short on time, so I will not get you like to the details of what's standing behind Kimera. It turns out that Kimera is running in a number of modules which are the state of the art of visual inertial navigation, 3D deconstruction. And everything is running in real time.
I will not have time to discuss the details of the architecture. This paper is already online. You can take a look both at the open source [INAUDIBLE] and the paper if you want.
So Kimera is providing a way to envision in 3D. But we said that we also want to reason over relation among objects. And this brings me to the last contribution I want to mention, which is this idea of 3D dynamic scene graphs. 3D dynamic scene graphs are just a way to obstruct the environment-- to understand the environment at different levels of abstraction.
So here, what you see is that, at the lowest level of abstraction, we have the metric semantic mesh that I was showing in the previous video. But besides, I mean, this mesh essentially, we also abstracted with the environment in multiple layers.
And in particular, we think about detecting objects and humans in the environment, detecting other layers such as phases and structures. Thinking about structures as being walls and ground here. Detecting the rules and detecting buildings.
So at the end of the day, the dynamic scene graph-- 3D dynamic scene graph is a directed graph where nodes are spatial concepts, and edges represents spatial temporal relations within concepts. And for example, I can see that an object is in a specific room and [INAUDIBLE].
This work is something that we put online like a few months ago. It's of course building on a number of excellent work. I want to feature just two works which have been really an inspiration for this line of research. The first one is the work from Iro Armeni, Jitendra Malik, and Silvio Savarese at Stanford on 3D scene graphs. So we have been essentially expanding these two captured dynamic elements into [INAUDIBLE] robotic set up. But many of the ideas are really inspired by the 3D scene graph from Armeni.
And I wouldn't be surprised if what I'm seeing is very natural for a number of you who are really working a lot on perception over the last many years.
So for example, there is a clear connection between the work from Josh Tenenbaum on intuitive physics and essentially validating the plausibility of a scene-- even a reconstruction, a potential model of the scene.
So again, I have the opportunity here to use 3D dynamic scene graph to reason over relation among multiple objects and conclude on plausibility, and to just take a holistic view rather than thinking about objects in isolation.
And a little bit of a longer video here showing, how do we get to build this dynamic scene graph? And you can imagine that this video is actually a combination of Kimera, which I discussed a few slides ago, as well as the certifiable algorithms for object detection, which I discussed in the first part of the slides.
And there is a lot of cleverness going on about modeling stuff in 3D, but without modeling as part of the 3D construction of humans, for example. And this is some cleverness about tracking humans over time as dense models. And also, just reconstructing the location of objects in 3D.
So I will probably just give you a final view of the 3D linewidth scene graph. And again, there is a paper on arXiv which we put there a couple of months ago with all the technical details if you are curious. But you know, this is in a presentation which we believe is going to be very powerful for high level reasoning and also to validate essentially the correctness of our world understanding.
So this brings me, I believe, to pretty much the conclusion of the talk. And talk has really a couple of messages here. The first one is that getting performance guarantees, which is having a failure rate which is less than 1e minus 7 for spatial perception really requires rethinking current algorithms. And other than pushing on the deep learning side of things, in this presentation we discussed the idea of certifiable algorithms to get a robust performance for estimation in the face of extreme amount of outliers.
And the second point was that we need a theory on how to connect robust algorithms into a robust system. And I argued that 3D high level understanding is really key to evaluate plausibility of the scene and to essentially get through robustness.
So of course, you know, I want to finish-- to wrap up here by saying that I'm presenting this work. I'm very excited about it. But most of the work, of course, is essentially something that the students in the group have been pushed. I'm very proud of the students which I'm listing here on the slides. I want to thank them for putting 200% of effort essentially to make this vision possible. And I want to thank the sponsor for supporting this work, and I want to thank you guys for your attention.
I'll probably stop here. I will put on just some advertisement of Reinforcement Learning Challenge which we are organizing in the next few weeks. But I will just leave it there in case you guys are interested. Thank you.
PRESENTER: Thank you, Luca. Let's see whether there are questions first of all from the panel. Josh? Gabriel?
JOSH TENENBAUM: Well, let's see. I mean, I think just first generally, I think it's really interesting what you're doing. Because I think the general approach to sort of integrative scene understanding that connects geometry across different levels-- objects, agents, places, physics. You know, it's very exciting and really fits together with lots of things going on at CBMM.
But I have a bunch of questions. I guess really mostly they come down to ways in which I think some of the intermediate representations that you're positing as part of building the integrative system-- and this includes in the first part of the talk on object detectors, as well as in the second part. I think some of these are things which maybe the brain has. And others I think I'm sort of skeptical of.
And it's not that-- I mean, I wouldn't say we-- we're still really trying to figure out, what are the brain's intermediate representations? We don't know, right?
But the idea of [INAUDIBLE], for example. Like I'm kind of skeptical.
LUCA CARLONE: I cannot hear you very well, Josh. You dropped out after saying that the key points-- you're skeptical about key points as being a representation for the human brain. But I lost you after that.
JOSH TENENBAUM: Great. Yeah, so that's one. And, you know, partly I'm thinking of how I can-- you know, I can show you a novel object that you've never seen before. And you can see it in 3D. And nobody identified the key points. You know, I don't know. Like I just have so many objects around my office here, my home office.
You know, I don't know. Here's one of these funny power adaptors. Which if you haven't seen that before-- or like the first time you see one of these headphones.
And like I can form a model of that object from just a little bit of experience. Now I can do a reasonable job, if not perfect, of localizing that in new scenes. But where did I get the key points from? I don't know. The other part is more to do with the second part of the talk on the spin system and the Kimera. Where--
I looked at the paper partly cued by-- I think you were giving this talk here, and the talk announcement, and I think Google Scholar told me I should check out your archive paper anyway. So I looked at the spin paper. Again, it looks really interesting. I can think of all sorts of ways to engage on that. And I hope we can do that.
But if I understood correctly, what you were doing there was you were using a semantic segmentation map, even like a ground truth one, and sort of projecting like ground truth semantic segmentation onto a 3D mesh, which I think is cool. But again, I mean, first of all, that's not trivial to compute. And second of all, I think there's lots of reasons to think that we don't do 2D semantic segmentation and then use that to get to 3D.
Rather to the extent that we do semantic segmentation at all, it's by doing first robust 3D perception and then understanding about object categories. There's lots of things when I look around again in my complex, cluttered office here. Like I don't know, what's that called? I don't even-- like words, semantic labels don't pop to mind. But I see the surfaces. I see the objects.
And here I'm partly inspired by the human ability-- in some of the work we do in CBMM, inspired by and collaborating with the infant researchers like Liz Spelke and with Tomer Ullman on the computational cognitive development side. Or, you know, other work that people study in animal perception, where you know-- even animals-- humans and other animals who don't have language, who don't have semantic labels for object categories, we still see the 3D world very, very robustly.
And so that's partly what makes me skeptical that there is an important 2D semantic segmentation layer on the way to 3D scene understanding.
So I guess I'm wondering your reaction to that. Do you think-- maybe the right answer is, no, actually we should expect to see those representations in the brain. We just haven't found them yet. Or maybe you think, well, no those are just kind of convenient intermediaries that we're going through, and we could go through others. Maybe there are others that are better intermediate representations. Or maybe we don't need intermediate representations.
Although I feel like part of what you're arguing for is the importance of some intermediate representations which are assembled into a systematic structure. And that robustness comes from the fact that it's not just a single black box and system, but that there are these different representations with meaningful relations between them. That's part of how you establish robustness. So anyway, I'd like to just hear how you're thinking about those issues.
LUCA CARLONE: There are many, many points. OK, so thanks for the extensive comments. So first of all, I want to express the fact that your group, your work, has been inspiration for all the work that they are doing on plausibility. But again, physics and physics simulations can be an inspiration and is something that fits very well with this idea.
So the second thing that is-- I must say, as I mentioned, the second part of work. I think it's the second part, which is about Kimera, the spin and so on, is a line of work which I'm very, very excited about but I'm far from saying that that's solved. I'm far from saying that the only way to go is essentially to get to the segmentation and then back project to 3D.
I see that right now there is kind of a convenient representation. Because we're moving, for example, from 2D to 3D. You can do a lot of Bayesian inference there, which is allowing you to do a little bit more robust essentially inference of what's going on in 3D. But I'm far from claiming that that's the only way or that's the way that, as humans, we use.
There is probably-- I can tell you really what's missing. I can tell you what's missing in Kimera for example, that we go easily from-- we are not at easily the geometric semantics, right? We build a geometric mode and essentially we label it over time. But there is not a true feedback, which is going back from the semantics to the geometry. So--
GABRIEL KREIMAN: That's what I was going to ask about. So that's a good point. But keep going.
LUCA CARLONE: So that's something for which-- the first thing that my students, when we started on this, this is like the initial proof of concept for this, but it's definitely not the way we work as humans. As human, if I look at the table, at the table in front of me, I'm captioning at the same time geometry, semantics and even the physics, right?
And rather than being-- geometry, semantics and so on. Rather than being a number of modules, you know, sequential modules [INAUDIBLE] over time, there's something that some will help each other. So they get lighter with the computation and then they get more robust.
So that may be something that we need-- we'll need more work, and it's something that I'm very excited about, because I think that true robustness really must rely on that, on the kind of redundancy doing the semantics, geometry, and physics. That is where you get robustness.
JOSH TENENBAUM: That's definitely something to talk more about in the context of interactions around CBMM. Because again, our best understanding-- or at least from, for example, Nancy Kanwisher doing brain imaging, but a lot of other people's as well, in the human brain, right? Is that there are different systems, effectively, it seems for geometry, for physics. I don't know about semantics exactly. Certainly there's language semantics, but there's probably more-- the various kinds of semantic knowledge systems.
And so there is some modularity. And yet, we also think that, yeah, that human understanding comes from how these are coupled into some dynamic network. And our current neural network computational models, you know, are mostly feedforward. Or even if they're recurrent, they don't really grapple with how to integrate what seems to be going on in different subsystems of the brain for geometry, physics and semantics.
But the CBMM ambition is to do exactly that. So maybe we really have a lot of common cause in terms of what is the next step for us but also maybe what is the next step for you.
LUCA CARLONE: Oh, yeah. And also, I'm fully in the line of research of also like multimodal perception, the extent. The fusion of information should not be limited to the visual channel. It should be really like fusing software information. So I think there is a lot of synergy in the vision, which is something I'm very happy about.
I think one thing that is in a high level of the-- high level idea in the presentation, which I believe is true. But of course, on the human side of things, I'm studying the robotics and machine side of things. Something that pops up in the first part of the presentation is this idea of essentially validating data against prior models of objects. And I believe that is really-- you can argue that the mathematical way we do that can be like in the right way or the way that the brain does computation or not.
But this idea of comparing in a robust way the data against the model that we have in our mind, at least, is something that is key to robustness.
The other observation, which may be a bit controversial, is that for the line of work we are doing in the group, we are thinking about the human side of things as a huge source of inspiration. Because it creates like a proof of concept, right? We get these capabilities executed in a very easy and effortless way by humans.
But we are also-- we are not postulating that the way humans are doing the computation is the optimal way or the most robust way. So we are essentially-- something that probably you guys agree with is that humans are acting with very limited power budget, computation and so on, and are doing their best where the computation is allowed.
If I throw you as much computation as I can-- if I threw GPUs at you, I don't see why you could not do better than that. So there is hope. Something that we keep in mind is that maybe with a different computation infrastructure or a different type of intelligence, we might get better performance on some task. Not general intelligence, but maybe there is more object detection.
[INTERPOSING VOICES]
JOSH TENENBAUM: Yeah. That's a reasonable idea, and one that I don't think people here would be unfriendly to. I mean, we might think that if we really understand the brain in engineering terms, then we also would understand how it would work with more compute or less compute. Sometimes we're tired. Some other brains are, like ours, are smaller. You could imagine bigger, better brains.
But you know, this issue of like what you do with limited power. You know, I guess I'm a little surprised to hear you say that. Because our colleagues in robotics, especially mobile robotics, emphasize that that's the problem that they face too.
So a couple of years ago, CBMM took a-- we had a big interesting conference out at GoogleX, where it was like a summit between Google and X and some DeepMind people and CBMM people. And we got to interact with the Waymo people at the time.
And they made the point that, you know, they have limited resources. They only have one LiDAR but lots of cameras. They also have-- computers is a limitation. Battery power is a limitation Tesla, of course, makes this point. Computers are-- especially the batteries are heavy, right? And you have limited power. Or like Amnon Shashua, who's one of our advisors and longtime student and colleague of Tommy's, makes the point of, well, what can you do on a chip in the context of mobile, light?
And you have quadcopter drones. Those have to be light. They can't have too much power.
So I mean, I think whether you're talking about an animal or a robot, but especially mobile robotics, especially lightweight mobile robotics like drones, I mean, I think you have to be thinking about limited power and what can you do on a budget?
And there again, biology might be an inspiration. I mean, I think many people point out that not only do we not have anything like robust scene understanding in machines that can compete with humans. But even with all the GPUs we want-- not to mention the fact that our power consumption of our brain relative to what seems to be the amount of computers already like phenomenally better than any machine that anybody's built.
So anyway, I think there's a long way to go to still gain engineering insights from studying the brain. Before we just decide we're going to be superhuman. But yeah.
LUCA CARLONE: Let me [INAUDIBLE] actually end up thinking about power consumption. As everybody in robotics, as you say, I end up thinking about power consumption a lot. And maybe a couple of years ago, I used to think that cheap design was the way to solve that. We designed cheap for vision-based navigation and things like that.
And, you know, specialized hardware is definitely something that humans are leveraging a lot, I guess. Right? So very specialized circuitry for processing. I think that's definitely an interesting avenue.
But the other thing which-- again, to this audience, we're not--
JOSH TENENBAUM: It could be the algorithms, right? It's not necessarily the hardware. But--
LUCA CARLONE: Hardware, yeah. And the second point is really about the algorithms. And which, again, will not come as a surprise to this audience is attention. This idea of connecting essentially perception to the task that the robot has to execute, which is something that popped up.
Of course, there is excellent work from Tomaso, from you guys, from Antonio, on the attention for the safe computation and so on.
But on the robotics side, I mean I think that is a huge gap in connecting the task that you're given with the amount of processing that you have to do. I mean, there is a long way to go there.
Something very interesting which is also my to do list, research-wise, it's a tough problem. So yeah, I completely--
PRESENTER: Let me interrupt. It's very interesting discussion, but just to open it up a little bit. Before getting Gabriel involved, let me ask Bernhard Egger in the audience. He has a more technical question.
AUDIENCE: So with RANSAC, usually the problem or the challenge is that if the outliers are somehow directed, and not just random noise, that it's getting very challenging. And I was wondering if your method is better coping with somewhat directed outliers.
LUCA CARLONE: Bernard, thanks for the question. It's an excellent question. That is the setup that I was calling in the slides "adversarial outlier." So if the outliers can really form structure, which is designed to confuse the algorithm, that can really hurt RANSAC for example.
And what we are showing-- in this kind of work concept [INAUDIBLE] algorithms is that these kind of algorithms can do much, much better in the case of structure noise.
So as long as you have-- you know, you can have all the structure noise and outliers that you want. But as long as you have enough inliers to defend the correct hypothesis, essentially the algorithm we are proposing will still be able to find the right structure.
And there is something that I did not mentioned in the slides, but it is kind of very interesting, is that if you're asking-- if you have a point clouds, and I ask you, come find this bunny in the point clouds. And in the point cloud there are two bunnies instead, the algorithm actually, instead of producing a rank 1 solution, will produce a rank 2 solution. And from the rank 2 solution, you can get both hypotheses for detection, which is something that's fairly unique.
So we'll be able also to get multiple directional objects at the same time. So structure noise is a set up in which the algorithm will actually shine.
Of course, there is a set up with structural noise in which, if you allow more outliers than inliers, then the problem is [INAUDIBLE]. No algorithm can solve it, because the outliers will be the most likely hypothesis there.
AUDIENCE: Thank you very much.
PRESENTER: The next question was from [INAUDIBLE].
LUCA CARLONE: Yeah, so the question from [INAUDIBLE] is, the optimization point is subject to variable belonging to a non-nuclear manifold SE3, which is the pose of the object.
How optimization manifold and TEASER++ are linked. So that's a deep question. It's a very interesting question. The first thing I want to mention is that the structure of the manifold, the way you describe the manifold in terms of constraints in the optimization problem, is really what is enabling the relaxations that we do.
You can describe the rotation matrix in terms of quadratic constraints. And this is exactly the type of constraints that you're able to relax in the formulation.
So the fact that the sets we are dealing with-- rotation, pauses-- are particularly nice, you know, they are non-Euclidean but they are particularly nice-- is really what's making this kind of relaxation unique.
For other problems, you would not expect convex relaxation to work so well. But the math of the set is particularly nice.
There is a connection with optimization manifolds, which we haven't been able to fully exploit. So sometimes you can do optimization on manifold to solve some of these semi definite relaxation arising here. So you can actually convert just the relaxation into an optimization problem with our manifold.
And you can use optimization on manifold to solve this very efficiently. Which is something that we did in the past for other work on localization and mapping, but we haven't been able to do with TEASER++, just because the geometry of the manifold is a bit-- it's nice, but it's a bit more complicated.
So an operation that you would do for optimizational manifolds like retraction and things like that become a little bit more expensive.
PRESENTER: OK. So thank you. Let's go to Gabriel.
GABRIEL KREIMAN: So I was quite fascinated by your presentation. I think there's a lot of food for thinking further about this. I have a lot of technical questions, but I want to skip those to ask a more philosophical question here. I was particularly impressed by this notion of performance guarantees and estimation contracts. And I want you to help me formulate a question. And the answer is, well, I'm always preoccupied, when we write our own algorithms, about this question about within distribution versus out of distribution, interpolation and extrapolation. How well can you do?
And it wasn't very clear to me. And I apologize. I need to catch up with the reading, but it wasn't very clear from your work if you trained with-- to recognize the inliers and outliers in the car, can you still recognize bands and trees and do object detection? All of these performance guarantees and estimation contracts, how-- are those circumscribed to a particular distribution?
You didn't talk much about distribution. So maybe you can help me formulate the question and the answer.
LUCA CARLONE: It's a good comment. I would say that, so first of all, on the estimation side of things, there is no learning. The way we solve the optimization problem does not assume any learning or training set.
So what is producing the data which is the deep learning set of things can have, of course, training distribution and so on and can produce more or less outliers depending whether it's in distribution or out of distribution.
So I think the opportunity here is that the approaches that we are proposing-- you can think about the approaches that we are proposing as-- you can simplify and think that these are just reliable ways to identify inliers in the data, right? Being agnostic to who is producing the data, being agnostic who is producing inliers or outliers.
So the algorithm, without any requirement for learning, is telling you, these are-- most likely, these are inliers.
And the interesting thing is that you can use that information to understand if your neural network is using-- is working in distribution or out of distribution.
For example, if you start adding 90%, 99% of outliers, maybe your neural network is suffering too much. Maybe you're not using it properly or you're not-- you're using [INAUDIBLE] distribution, essentially.
So we see that this approach, while not being learning based, has the potential to feed back information to the deep learning side of things. It's something that we are exploring right now, but we haven't essentially got to-- we haven't got results about that.
But there is no assumption here. The statements-- most of the statements-- all the statements are deterministic. So if you get rank 1 solution, you are able to find inliers. And if you have a number of inliers which is slightly larger than the number of outliers, you get the right solution. That's like in a nutshell our results.
GABRIEL KREIMAN: And then very quickly, a more technical question which I didn't quite follow, what guarantees that you'll get a rank 1 solution? Or how do you know that z would be a rank 1? That seemed like magic to me.
LUCA CARLONE: That's-- it is magic. That's the most shocking part. It's something that many people familiar with optimization will get, but this is really the crux of the presentation. That if you design this relaxation in a clever way, you get these Rank 1 solutions. At least you get this Rank 1 solution except for very degenerate problems.
And I guess the best answer from my side is this theory of, you know, lesser hierarchy. I think that's really-- it's not my answer, but it's an answer from the optimization theory. Lasserre hierarchy, essentially, is providing a general way to do convex relaxation of polynomial problems. And there is a very powerful result saying that, if you make this relaxation large enough in size-- OK? And then there is this thing that is called the order of the relaxation. So the larger the order, the larger the size of the problem. There is going to be a convergence of relaxation to the solution of the problem that you relaxed.
So if you increase the size large enough for some finite size, the complex relaxation will solve exactly the problem that you relaxed. And that's a very general result that goes back to the work on that sum of squares and the work on moment relaxation from Lasserre, [INAUDIBLE] at MIT. There is a deep [INAUDIBLE] on the optimization theory side of things.
So the theory behind that observation is already there. I think the interesting thing that we see that the relaxation is exact already for a very small size. So if you get a very large relaxation, which is studied, it's not useful. Because you cannot solve it.
But if you're observing that relaxation [INAUDIBLE] side, even while being fairly small in size. So you can solve them in a reasonable time.
That's a great question that will probably deserve a much longer answer.
GABRIEL KREIMAN: No, this is fantastic. I have another question, but I think there are other people that I see now in the Q&A. Maybe we should go through those.
PRESENTER: Yeah. Claudia, can she ask the questions?
LUCA CARLONE: Claudia, go ahead.
AUDIENCE: I'll take that. Thank you for your talk. It was really informative. So I have two questions in two different spaces. I don't know if we have time for all. But the first one is about your statement about 3D vision being or 3D perception being really needed.
So it's a very strong statement to say that you cannot solve those problems with 2D vision, right? And while I agree that definitely point clouds and 3D perception is key to some problems in robotics, I can argue that I can solve the problems you show with one eye, right?
So my question is about your thoughts on augmenting 2D vision with priors in other spaces. For example, knowledge that we have about objects and how they function.
LUCA CARLONE: It's an amazing question. And of course, was kind of the observation there. But which I believe is true, actually. But [INAUDIBLE] too. I'm not seeing that you cannot solve perception, or you can not get a word model, from a 2D image.
Of course, you will have like monodepth or you will have a number of contribution doing that.
What I am seeing is that the monitoring, or the verification side of things, must be in 3D. You must be able to-- something that is a bit hidden in the presentation, it's something that I really think, is that to be confident enough in the work model that the robot is reconstructing, you need a lot of redundancy.
And that redundancy is something that for sure comes from observing the scene from multiple angles. So again, you can solve problems from 2D images. But understanding verification and being confident, like enabling safety guarantees, really requires going beyond the 2D image.
And of course, I'm sure you have seen multiple optical illusions in which you are looking at the image, your prior is guessing a specific shape of the image, like the one that I was showing in the slides. But when you change perspective, the 3D model that you had is not valid anymore. So that's just the latest example in which 2D perception essentially will mislead you.
I hope-- you know, that's strong, it is strong. But the take I'm taking is from the verification and certification standpoint.
AUDIENCE: Got you. Thank you.
LUCA CARLONE: I don't know if you want to add, Claudia, with the question about the dynamic scene graph as well.
AUDIENCE: Yeah, I'm just respectful of the time. But that question is about human robot interaction. And I saw in your video that you are considering dynamic scene graph detection and tracking of people within these scenes and buildings. And it's one of the layers of the graph.
So I'm just thinking-- very curious what are your thoughts on what are very interesting applications of this graph for human robot interaction applications.
LUCA CARLONE: Yeah. I think there is a universe there. And it's something that of course, you guys are probably also exploring. I think there is really a universe on how you use this.
There is a universe behind really learning the structure of this graph instead of just building it. There is a universe of other potential uses of this graph. For example, again, inferring the behavior of humans or inferring supporting human robot interaction with that.
Just starting to explore all the different avenues there, but there's really a universe. Things that we are looking into right now is, for example, even planning a different level of abstraction. Once I have a hierarchical model, if I have limited computation, maybe I want to plan a different level of abstraction to set computation.
But there are really like so many questions there, one being human robot interaction, which are truly enabled by what guys at Stanford have been doing on 3D scene graphs, and even more by the dynamic [INAUDIBLE] to humans.
You know, the basic thing that you can do with humans is-- which is something that typically also sponsor care about, is about tracking the location of humans over time, understanding, projecting future roll out of whether humans can be considering the current trajectory, inferring type of behaviors, action recognition. You can do a number of things.
All to explore an area that-- this is the kind of thing from which a single group would not be able to explore all the opportunities. I hope this is something that can become interesting for many researchers. There is a lot of work to do.
AUDIENCE: Yeah. I'm excited it could enable new representations for planning long term in the future, for multi-step interactions between one or multiple robots in real buildings. So yeah, very interesting. Thank you.
LUCA CARLONE: Thank you. Also going back to the work from Josh, thinking about rolling out-- you know, essentially what you have Kimera kind of set up for a physics simulator, trying to roll out a simulation in the next few seconds to understand-- to support decision making.
I think that's something that would be pretty easy to do, connect well to work from Josh. And--
JOSH TENENBAUM: Let me just say-- look, I'd love to follow up on that.
LUCA CARLONE: Please. Please.
JOSH TENENBAUM: --as well as connections to all the other aspects of understanding people and multi-agent interaction. And I see the question that Marco has asked too. I think that's also a really interesting question.
I have to actually get off right now to go to another meeting that I'm a little bit late for. So I'll just say thanks-- from my point of view, thanks for a great talk, really interesting questions and discussion. And I hope we can interact a lot more in the near future.
LUCA CARLONE: Thank you so much for the comments. Take care.
PRESENTER: Very good. I think that's probably the appropriate time to conclude this. Thanks a lot, Luca. Thanks Gabriel. And thanks to Josh and Chris. So keep safe.
LUCA CARLONE: OK. So I will conclude-- I think Marco didn't have a chance to ask his question, but maybe he can follow up by email. I think something that I didn't mention, Marco, is that the deformable model is already in there. I didn't stress that part, but it's already in there.
So we can locate-- we can model library of shapes rather than single shapes that's already there. But follow up, please, if you have questions.
The other comment-- the other thing that I want to say before wrapping up is just to thank you guys for staying here till 5:30 for Tommy to be like an incredible host for this seminar. And just thanks for the great discussion.
PRESENTER: Thank you.
LUCA CARLONE: Thank you again.
PRESENTER: Goodbye. Bye-bye.
LUCA CARLONE: All right. See you soon.
PRESENTER: Keep safe.