The Convergence of Machine Learning and Artificial Intelligence Towards Enabling Autonomous Driving (1:15:30)
Date Posted:
March 25, 2017
Date Recorded:
March 24, 2017
CBMM Speaker(s):
Amnon Shashua All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Amnon Shashua - Hebrew University, Co-founder, CTO and Chairman of Mobileye
Abstract: The field of transportation is undergoing a seismic change with the coming introduction of autonomous driving. The technologies required to enable computer driven cars involves the latest cutting edge artificial intelligence algorithms along three major thrusts: Sensing, Planning and Mapping. Dr. Shashua describes the challenges and the kind of machine learning algorithms involved, through the perspective of Mobileye ’s activity in this domain.
Biography: Prof. Amnon Shashua holds the Sachs chair in computer science at the Hebrew University of Jerusalem. His field of expertise is computer vision and machine learning. For his academic achievements, he received the MARR prize Honorable Mention in 2001, the Kaye innovation award in 2004, and the Landau award in exact sciences in 2005.
In 1999 Prof. Shashua co-founded Mobileye , an Israeli company developing a system-on-chip and computer vision algorithms for a driving assistance system, providing a full range of active safety features using a single camera. Today, approximately 10 million cars from 23 automobile manufacturers rely on Mobileye technology to make their vehicles safer to drive.
In 2010 Prof. Shashua co-founded OrCam which harnesses the power of artificial vision to assist people who are visually impaired or blind. The OrCam MyEye device is unique in its ability to provide visual aid to hundreds of millions of people, through a discreet wearable platform. Within its wide-ranging scope of capabilities, OrCam 's device can read most texts (both indoors and outdoors) and learn to recognize thousands of new items and faces.
THOMAS APAGO: I'm Thomas [INAUDIBLE] I am welcoming all of you on behalf of the Center for Brains, Minds, and Machines. Which is a center at MIT with other partners across the country-- the main one being Harvard. It is one of 14 large NSF centers in all areas of science and technology. This particular one is about the science and the engineering of intelligence.
We had today our external advisory committee meeting. This is a great set of people and friends and leaders with great wisdom that gave us, and give us over the years, very useful advice.
And as some of you may remember last year for our meeting we heard from Demis Hassabis, also a member of our advisory committee, about AlphaGo. That was the first talk he gave after AlphaGo winning in Seoul against Lee Sedol, the kind of unofficial world champion of Go.
And today we'll hear about Amnon and Mobileye and autonomous driving which I think is the top achievement of modern deep learning and AI so far. I'm glad and proud to introduce to you Amnon Shashua, founder of Mobileye, professor at the Hebrew University. I met Amnon first time when he arrived at MIT for his graduate studies. This was '89? At the AI lab. And he was a student of Shimon Ullman, and eventually mine during his PhD. He stayed with me for a postdoc.
My main claim to fame in terms of training him is to get him interested in entrepreneurship. We went together on a business trip to Japan that led I guess to a failed startup. But this was followed by successful companies he's funded, CogniTens, Mobileye, and he is, of course, one of the most successful among my students and postdocs. And is also one of the greatest human beings I know.
It started with a master's thesis on saliency computation in vision in his PhD. He dealt with invariants to illumination. And then he wrote a paper, I think at the end of the fellowship in my lab, about multiple view geometry. Which introduced a fundamental algebraic relationship between three views, which is known today as the trifocal tensor. I don't think I understand the mathematics even today, but that's a side comment.
And then this found also theoretical practical applications ranging from 3D reconstruction, calibrating cameras, robotics navigation, and so on. He got several Best Paper awards, MARR prize in 2001, and so on, and so on. He was chairman of the School of Engineering and Computer Science at Hebrew University-- 2003, 2005. In '95 he started CogniTens. And in 2000 he founded Mobileye. And in 2012 another company Orcam, about which you may speak? Or no? OK. So the latter two-- so Orcam and Mobileye-- are both partners of CBMM, and are the ones I'm most proud to speak about in terms of pioneering the technology of intelligence.
So Mobileye is an amazing case. I think it is the prototypical success of machine learning and computer vision in the last year because of the extent and the success of the underlying mixture of sophisticated theory and impressive technological applications. And I want to show you a little video, you will see videos from Amnon today. This one was something we did in my group in '95. So that's 22 years ago. This was a project with Daimler-Benz. It was one of the first applications in computer visual machine learning.
We trained something similar to Support Vector Machines with about 2000 images of pedestrians. And then we ran this system, there was a computer in the trunk of a Mercedes. This was done in [INAUDIBLE] Germany. And the system was able to detect pedestrians, and making some false detection as well as you can see. There is a traffic light we get classified as a pedestrian on one frame. And so we had, at that time, an error rate of one error every three frames. And we were very happy about it. But this is 10 errors per second, right?
So now I think Mobileye has one error over, I don't know, 30,000 miles-- or something like that. And so this would mean about 1 million times rough order of magnitude, better accuracy. So doubling accuracy by two. Doubling accuracy every year for 20 years. About 1 million times. That's machine learning, you know, the progress of it. Amnon, your turn.
[APPLAUSE]
AMNON SHASHUA: Thank you, Tommy. I have my adapter here.
So you all received the flyer. We'll take care of it after-- during the Q&A session we'll talk about this flyer. So it was supposed to be an intimate lecture. I'm a bit overwhelmed of what Tommy did to me. So, you know, autonomous driving captures everyone's imagination. And there are a lot of aspects to it. There are policy aspects, engineering aspects. So let's do the following-- so I'll give a lot of time for Q&A since this is an area that people feel very strongly about, and would like to ask lots of questions. So I have say 45 minutes, 30 minutes or less, I'll talk and then we'll have lots of questions and answers. And I'll focus only on the engineering side. Now during the Q&A I'll also put on my executive hat. So you can ask questions about policy and so forth. But now I'll focus only on the engineering side since we are here at MIT. That's really what matters-- engineering.
So I'll talk about the fundamental problems that underlie autonomous driving. So we're talking about machine learning, artificial intelligence. I'll focus on where exactly is AI hidden in all of this mix. Because people talk about AI almost about-- everything people talk about is AI. I'll be more specific. Where it is exactly hidden in the equation. I'll talk about different approaches. There's more than one approach to do things. I'll talk about the wrong approach and the right approach. And guess who does the right approach?
[LAUGHTER]
AMNON SHASHUA: OK? So let's begin. So in order to do autonomous driving there are three areas that we need to master. And I'm starting from the least complicated and then moving to the most complicated. So the least complicated is all about sensing. So sensing, we have cameras-- say 360 degrees-- we have radars, we have laser scanners called Lidars. And we have high performance computing, very sophisticated silicon, that receive all this data into the silicon. And then we have sophisticated algorithms that interpret the data. So interpreting the data is building an environmental model. We need to know where all the road users-- vehicles, pedestrians, cyclists-- are. We need to know where all the past delimiters are-- like curbs, barriers, guardrails-- where we can drive, where we cannot drive, the free space. We need to find, of course, the traffic lights, traffic signs.
And most complicated is the drivable paths. When we look at the road we have semantic meaning to everything. We know that this lane is a solid lane, fragmented lane, it's a road edge. This lane leads to a highway exit. The lane and the one on the left are going to merge. There are pavement markings. We take all this information and we understand-- we understand the semantics underlying them. So this is one area where artificial intelligence is hidden. So this is sensing. It is relatively well-defined although there's more than one approach. And I'll focus what are the differences between those approaches.
The second area is about mapping. So mapping is not only technology, it's also a logistical problem. So I'm not talking about the navigation maps, I'm talking about very precise maps. Precise meaning you need to localize yourself in that map at an accuracy of 10 centimeters. So GPS would not give us this accuracy. It could give us this accuracy when we're in open areas where differential GPS are the case. But when we are in urban settings, you cannot get a consistent 10 centimeter accuracy. So one thing is localization, 10 centimeter accuracy.
Second is the richness of information. We need to know where all the lane markings are, their semantic meaning, the drivable paths. All what I talked about sensing, just remove the road users, the vehicles, the pedestrians. What is left is the building blocks of maps. And these are called high definition maps.
So it's not only a technological issue, how do you build these maps, it's a logistical issue. How do you build them in a way that is very low cost and scalable? And so it's not that you want to support only one or two cities-- you know, spend a lot of efforts and map mountain view. Big deal, right? We want to map the entire U.S. How do you scale up?
Second, is how do you create a live map? Because if this map is going to be critical to support autonomous driving then it has to be always correct. Always correct meaning that if something changes in the environment we would expect that this change would be reflected in the map almost instantaneously. So kind of near real time. So how do you do that? Because traditional map making is very, very time consuming, very laborious, lots and lots of manpower. And it needs to be invested-- very, very costly. So it's also a logistical problem.
But why is it a logistical problem? Because-- and here I'm putting on an executive hat-- the idea is to build an economy, it's to build a business. Now if the cost of supporting autonomous driving is more than the cost of having a driver drive the car-- and these bloody maps could exactly make us reach that point-- then we will not have an economy. We'll have a nice science project, but it will not create a new economy. So this is a critical issue. How do you build these maps-- how do you build them in a way that that is scaling up?
The killer is the driving policy. What is driving policy? This is where most of the artificial intelligence is hidden. And this is largely an open problem. So if people tell you that autonomous driving is just around the corner, they don't know what they're talking about. This is really the Achilles heel of the entire industry.
So driving policy is all about negotiating. It's the reason we take driving lessons. We don't take driving lessons because we train our senses. We take driving lessons because we want to learn how to negotiate in dense traffic. And when you negotiate in dense traffic-- the culture of negotiation is really location dependent. In Boston we drive very, very differently than in California. So it's location dependent. And we negotiate. We don't negotiate by talking to each other, we negotiate by motion. Our motions signal to the other drivers our intention. And there could be dead locks. We want to do it in a way that is safe so we don't have accidents. And we want to be able to do it in a way that mimics human behavior because if we are the only conservative vehicle on the road we're going to obstruct traffic. And there are thousands like us we're going to clog the entire city.
So the robotic cars need to drive like humans. They need to drive like humans, but on the other hand they need to be safe. They cannot drive recklessly. So this fine line between driving like a human, on the other hand, driving safely is really an open problem. And I'll focus about this in a bit.
So these are the three-- I call them pillars-- the three pillars that we need to handle. So sensing is very, very difficult. But this is the easiest among them all. Mapping is a big logistical problem. And driving policy is mostly an open problem.
So I'll show now four clips that kind of set the stage, and then I'll focus a bit more technical. So what I'm going to show here is a clip of how sensing looks like at the output level. So we'll look under the hood. There are eight cameras around the car, there are also radars and laser scanners. But I'm showing only the output of the visual sensing. So what you're going to see when I run this clip-- these are the 3D bounding boxes around cars. This green carpet signals the free space. You're going to see also traffic signs, traffic lights. And you're going to see this from multiple views. The lane markings, also pedestrians, traffic lights. A few pedestrians.
And at the edges of this free space being shown here, there's also semantic information. Semantic information of whether this is a curb, a solid lane, fragmented lane, barrier, guardrail, and so forth. So this is what sensing is about. It tells me where all the road users are, the path deli-meters. And the dry bulb paths and I'll get back to sensing later and I'll explain what is really difficult in this. And that the three layers-- one is relatively straightforward, the other one is more complicated, and the third one is really a big issue. So this is sensing.
Mapping. So this is also a clip. What I'm going to show here-- the mapping is done in a crowd-sourced way. So this flyer becomes relevant. So 2018, we're going to have two million cars by Volkswagen and BMW generating data. So data that we generate in our chip inside these cars. What is this data? The data is about harvesting the lane information-- all these drivable paths that I mentioned before, and landmarks, traffic signs, poles. There's a vocabulary of about 20,000 different items that we recognize. Pavement markings, signs, poles, reflectors, all sorts. Fixed things in the scene that the vehicle can use for localization. So that the drivable paths are building blocks of building the high-definition map, and the landmarks are the building blocks for localization.
So what you're going to see here is that these are the lane marks, or the drivable paths. These circles are the landmarks. And you're seeing here two projections. This is a projection onto Google Earth. So this projection gives you a sense of accuracy-- say about 50 centimeter accuracy. So if the projection of the map onto the scene is such that you don't see the line aligned with a two lane mark, you know that we have here about 50 centimeter accuracy.
This is a projection onto the field of view onto the image space when the car is driving. This gives you a sense of the accuracy of centimeters. Because if this lane is not sitting exactly on the lane mark, we're talking about centimeters misaccuracy.
So if we ran this-- one moment-- OK. So you can see how accurate all of this is-- once you'll see the lane marks, you'll see how accurate this is. So we're talking here about an accuracy a few centimeters and all this map information is generated automatically. There is no manual intervention.
We have-- all cars which have a driving assist module, which is a front facing camera with our processing chip inside, is generating this kind of data. It's about 10 kilobytes per kilometer. It's sent to the cloud. And the cloud is being aggregated and the high definition map is being built. And then when the car drives it goes and identifies these landmarks and uses the landmarks to localize itself within this map at an accuracy of at most 10 centimeters. On average it's even five or four centimeters.
And then that is being used as redundancy for sensing. Why is that important? In order to guarantee safety we need to have redundancy in whatever we do. So when we go and detect road users like vehicles and pedestrians we have multiple sensors to get redundancy. We have cameras, we have radars, we have laser scanners. When we're talking about sensing the roadway-- sensing the drivable paths-- there's only one sensor that can do that. And that's the camera because it's texture based, it's not shape based. So a redundancy for the camera is the map. Without the map we don't have redundant information.
The map also provides us foresight. Once we know we are in the map, we can know where that path is leading beyond the range of sensing. Up to infinity basically. So this is critical to build this map. And this map is being built through crowd-sourcing.
Let me show you another-- this is in London. So that was in Las Vegas, this is in London. So all these are landmarks being detected. And this is a projection of the map data onto Google Earth and projection onto the driving scene. And you can see this is very, very-- it's highly accurate. So this is mapping.
So now when we put things together I'm going to show you a clip. This was a demo that we built together with Delphi. Delphi is one of our partners, it's a tier one company supplier to the car industry. So we built together a vehicle that drove hands free in a complicated city environment. Which was about a six mile stretch of city and highway. And it had about 100 drives a day, day and night, for four days. So it's not just one drive where everything is carefully planned. It was 400 drives. And if something can go wrong it will go wrong. Right? And nothing-- it was really perfect.
So this is a reporter. It's a one minute clip of a reporter reporting about what he sees. And it kind of puts things together. Let's run this.
REPORTER: I have latest version of Delphi's autonomous research vehicle. This is an Audi SG5 that Delphi's fitted with radar, Lidar. And what's new for this generation is a camera system from partner Mobileye. That means nine cameras around the vehicle that give this car a better sense of its surroundings. During the drive on a set route this car acted very naturally. It was aggressive enough but safe enough, it felt like a human was behind the wheel. There's a display in the car that showed me what the car was seeing. I could see when it could see pedestrians, crosswalks, traffic lights. It really had a great sense of its surroundings.
One thing that really impressed me is while we were in a left-turn lane another car cut in front of us and the Delphi car behaved perfectly. Another time we also went through a fairly long tunnel, the car lost it's GPS connection but still stayed on course. And one final thing that really impressed me is that this car uses crowd-sourcing to determine it's path down the road. It sees the path that similarly equipped cars have taken before it. And so it follows that path as well as lane lines. Now this is still a research vehicle. But Delphi says this system could be ready for production around 2019, which means we could see it in a production car around 2020 or 2021. We'll see many more autonomous car technology demonstrations at CES. So stay tuned to Roadshow.
AMNON SHASHUA: OK so those four clips basically set up the stage. So now let's go into more detail. So let's start with-- I'm not going to spend more time about mapping. Although I think it's a fascinating field, but I want to leave time for Q&A. So I think you've got the idea of the problem of mapping. I'll focus on two areas, on sensing and on the driving policy. And this is where all the intelligence lies.
So we talk about sensing. Where is the AI hidden in sensing? So the three pillars inside sensing. The first one is really the obvious one, and it is the easiest one. You want to do object detection. So objects are all the road users, and traffic signs, and traffic lights. So detect vehicles, detect pedestrians, detect traffic signs, traffic lights.
An object is something that you can put a bounding box around it. And this is the sweet spot of today's computer vision. Anything that you can put a bounding box around it computer vision is very, very good at. And in some cases, even better than human perception for this particular narrow task. And this is really an outgrowth of driving assist. This is what driving assist is all about. Driving assist is about preventing collisions. To prevent collisions you need to detect vehicles and pedestrians. You need to do it in a very high quality way. Today the false positive-- as Tommy mentioned-- the false positive of a system that does automatic braking on pedestrian detection is once every 30,000 hours of driving. So it happens maybe once in a lifetime of owning a car. So we're talking about something that is very, very high quality. But there was a long period of evolution of more than a decade of bringing this into perfection. So this is the easiest problem to solve.
The second problem is when you are talking about free areas-- you want to find the free space. So you have objects and there is free space in between these objects. And the boundaries of the free space are the curbs, the barriers, the guardrails. So an image is an input, the output is a free-form boundary with semantic information along this boundary. So this is the first place where we're outside the comfort zone of classical computer vision. We need to explain something a bit more complicated. This is already in production in cars. For example the Tesla autopilot, the first generation auto pilot, has this kind of technology.
The third one is really the most difficult one. We're finding the drivable paths. So here the input is also an image, the output is a story. It is not a free-form boundary, it is not a bounding box, it is a description of what do I see in the scene in terms of the roadway-- in terms of the drivable paths. Which lane is leading to where? What is the semantic information associated with the lane? It is a story. It is something from a perception from-- the challenge of a perception system, it's much, much higher than the previous two.
I call this strong perception. This is where most of the artificial intelligence that is left it lies in. And there's no system today that can do that. So this is an open problem.
So let's look at how sensing is utilized. And there are two approaches. So let's start. So the approach on the left I call this map heavy approach, and the approach on the right I call it a map light approach. And what I'm basically saying that the approach on the left is the wrong approach. The approach on the right is the right approach.
So what is the map heavy approach? This is the classical way. Many demonstrations that you see out there are using this approach. So what is this approach about? Use a 3D sensor, use a laser scanner. It uses a Lidar to find the vehicles and pedestrians. And these are placed in the 3D coordinate system of the car because these are 3D sensors. So a laser scanner will give you a cloud of points in a 3D coordinate system.
Then what you do, you localize the car in the high definition map. So somebody built you a high definition map. The classical map-makers, with the traditional methods, they built you a high definition map of the area around you. You localize yourself in the high definition map using again the data from the laser scanner. You have a cloud of points from the laser scanner, the high definition map also includes a cloud of points. You do this matching. And you localize yourself in this high definition map.
Once you localize yourself in this high definition map you take the road users that you detected and you simply put them on the map. So now you have all that drivable paths from the map. You have the road users placed on this map, and you're done with.
If you'll recall the first Google vehicles out there-- they had a camera only to detect traffic lights. They needed nothing more than a laser scanner. So it's a very cool thing. You put your road users on the map, you localize yourself on the map. You take simply all the semantic information about the roadway from the map. You don't need sensing at all because somebody built you the map. How that someone built the map? You don't care. It's not your business. They put a lot of manpower and so forth, but it's not your business.
And then you inherit-- you have now a unified coordinate system with road users and semantic information about the roadway. And you simply control the car. There's the issue of how you control the car with the driving policy. But let's put that aside. We're talking about sensing. So this sounds like a very cool approach.
And now if you have other sensors now you want to enrich and to make it more robust by adding other sensors. So other sensors would be cameras and radars. You need to make sure that those other sensors would be talking to you in the same coordinate system-- would be talking to you in a 3D coordinate system. Now this is hard. Projecting from 3D to 2D-- going from 2D to 3D is hard. So camera is 2D, and radars are also 2D. It's a different kind of 2D. But it's a two dimensional piece of information. So taking other sensors and putting them in this coordinate system is a bit tricky.
What is the map light approach? The map light approach is much more difficult to do. And then I'll go into the pros and cons. The map light approach is to use cameras to detect the road users-- the vehicles and pedestrians-- and the roadway information simultaneously. Because the camera is the only sensor that sees them both in the same coordinate system, the 2D coordinate system.
And this is what I had in the previous slide. I found all the road users, the free space, the drivable paths using strong perception. I used a lot of computer vision and I put them all in a 2D coordinate system. We're not yet in 3D.
Now you localize yourself in this high definition map. So that was the movie I showed you before. You use landmarks, you localize yourself in a high definition map that you built using computer vision beforehand. Once you do that you can bring everything to a 3D coordinate system because the map is in a 3D coordinate system.
But now when you put things in a 3D coordinate system, if you have errors, the relative information remains the same because the road users and the roadway work together in the 2D coordinate system. So if you have an error, it's going to be uniform. It's not going to be different error to the road users and different error to the lane information. So the relative information remains intact. And this is critical. The relative information remains intact. And now you have a unified coordinate system-- 3D the coordinate system-- and you can then control the car.
Now if you want to add additional sensors like Lidars and radars, you need to do a 3D to 2D projection. And that is easy. Doing a 3D to 2D is easy. So here's an example of projecting laser scanner data into an image. This is an easy problem. Projecting radar data into an image. This is an easy problem.
So now let's look at the pros and cons. This is a clip showing how things look like in a top view. This is a 3D coordinate system because it's a top view just by looking at camera information and the high definition map.
So let's look at the pros and cons. The real advantage of the map heavy approach-- you can do rapid prototyping. That means if I want to create a-- take a team of engineers and within six months do an impressive demo-- I can do that using the map heavy approach. I go and buy a high definition map, a laser scanner-- detecting road users using a laser scanner-- a few months of work I can do something. Especially for what I want to do is a demo. And then I'm done with. I don't need to do much more than that. I need to control the vehicle, I can do basic control. And all of a sudden, I can be on the news that I'm a player in autonomous driving. OK? So this is one advantage.
If you look at the disadvantage of this approach here, the high definition map become a single point of failure because it all depends on this high definition map. There's no redundancy. The driver will pass and the road users are in different coordinate systems, they don't live together. So it creates errors when you are doing this in 3D. And then the biggest problem is creating these high definition maps-- it is not scalable. There isn't a way to create them in a low cost manner because I'm relying that somebody else built me this high definition map. I'm not answering the questions from end to end. How do I make a system, including the creation of high definition map, that will be very, very low cost? Because in the automotive industry, if something is not low cost it will not materialize. And this is something that sometimes players in this field do not realize.
So what would be the advantages here is that first of all, as I said before, the camera is the only sensor where you have both the road users and the roadway information in the same coordinate system. This is very critical. The creation of the high definition map and the localization is using the same technology, the same computer vision technology, so I can crowd-source it. This is also very important. And also I can have systems which are low cost systems without a laser scanner. Now there is a lot of promise on laser scanners that in a decade from now will cost $200-300. Today they cost many thousands of dollars. It is nice, but people forget that $300 in the automotive industry is hugely expensive. A camera module cost about $20. So we're still 10 times more expensive than a camera.
So if I want to make a living I need to make sure that I have an offering to give without laser scanners. And in the map light approach the laser scanner is simply another sensor to robustify my sensing-- to robustify my interpretation of the world. It's not a critical element in the entire thing.
And then the cons is this is very difficult to do. This is not something to do in a six months effort to do a demo. This requires real commitment. It's years of work because there's really strong perception going on here because I'm really solving things from end to end. I'm handling the map, handling the sensing, handling the projection onto 3D, and doing it in a way that is low cost. So these are the two approaches.
Let me go into the third pillar. So sensing and mapping-- we're not going to talk about this anymore. Let's go to the third pillar which is the driving policy. And this is where most of the intelligence is located. So sensing there is intelligence, as I said, locating objects is something that we know how to do today. It used to be a challenge 10 years ago-- five years ago-- today it's not much of a challenge. The drivable path is a big challenge. And this is where some intelligence is hidden. The biggest place where you need AI is in this area of driving policy. It's a negotiation.
And to show you that this is an open problem, this is something that was-- I took it from a year ago-- talking about autonomous test vehicles-- autonomous cars are really clogging traffic. They're driving too conservatively. We see that this has nothing to do really with Google, Google is a great company, but we saw this also with Uber vehicles. They had test fleets in Pittsburgh. And reporters would report back what they sense. And you clearly see that whenever something semi-complicated happens the driver behind the steering wheel needs to take over.
So let me show you two clips on my way to work. And on my way to work this is Jerusalem and it's very, very similar to Boston. So driving in Jerusalem, driving in Boston, is very similar. So just to capture the complexity of this. So let let's run this clip.
So if you look at this vehicle-- so it's squeezing itself in. This is the first thing. Now let's look at this one. This guy is not going to succeed. And at some point we're going to fast forward this clip. We're fast forwarding and this guy is not succeeding.
[LAUGHTER]
AMNON SHASHUA: So negotiation can fail. It doesn't mean that we're going to have an accident, it means that we can fail in our desires.
Let's look at this one here. This is a real challenge because it's long. Imagine the lengths of planning that's going on here. Because at some point we're going to fast forward this. Still this guy-- poor guy-- is working his way in. OK? This guy also is going to squeeze in, it's going to be difficult.
[LAUGHTER]
AMNON SHASHUA: So do you think that any autonomous car out there can do something like this? No way. OK?
Let's look at another example. Let's look now at the concrete example. The concrete example is called a double lane merge. So double lane merge you have, as you see here, you have vehicles from this path coming in, vehicles from this path. Vehicles can cross or can stay in their own path. And what makes it challenging is that there are no rules. The only rule is don't make accidents. So they could be deadlocks here because you may be interfering with the plans of the other driver. It's not just squeezing in. So it's a complicated negotiation.
And so people normally say that the four way stop is the most challenging maneuver. But no, it's not a challenging maneuver because there are rules. There is the right of way rule. So you need to respect the rule and squeeze in. So it's not that challenging. Here because there are no rules it's very, very, very challenging. And the classical way of doing this is opening up a tree of possibilities, and traversing this tree. And this tree grows exponentially with the planning time. And you need to plan long as you saw in the previous clip. And the number of agents around you, and so it's really a very time consuming computationally intensive problem. And people find all sorts of [INAUDIBLE] and traversing this tree. And it's a big issue.
So let's look-- show an example here. And I'll show you in this example-- So this is a double lane-- a double lane merge. We have a path here, path here, and cars can stay or change lanes. So this is just to give you a sense of this double lane merge. And in a moment we'll see one challenging.
So this guy rather than starting here, he started here, and then he's going to cross upsetting everyone else while doing it. And the next one is not rare. This thing here, let's focus on this. Let's see what's going to happen here. And this is the deadlock. What we don't hear them cursing each other or something.
[LAUGHTER]
AMNON SHASHUA: So it's a challenging problem. And we would like to use machine learning. Now normally everything is machine learning today. So what's the big message of we would like to use machine learning? Well there is a big message here. And there is a reason why machine learning is hardly employed in this problem. So to understand this remember machine learning is a data driven approach. So rather than-- we say it's easy to collect data-- rather than understand the underlying causes behind the problem that we want to solve. So it works perfectly fine for pattern recognition, for natural language processing from voice to text. There are many areas in which if we collect a lot of data and feed it into a black box-- a machine learning black box-- we solve the problem.
But what is the downside? The downside is that machine learning is based on the statistics of the data. So there could be corner events-- rare events. And in order to cover these rare events we need lots and lots of data. Because we're doing stochastic methods, stochastic gradient descent, we'll need to run through the data many times until we flush all the rare events-- the coronary events. So this is kind of the downside.
So when you talk about sensing, sensing is a classical area where you want to use machine learning because you are sensing the present-- planning for the future-- sensing the present. The technology that you are using is deep supervised learning. And that's fine. When you are doing driving policy you're planning for the future. And that technology is reinforcement learning.
Now there is a big difference between these two. And the difference is the way we use the training set. So let's look at the case of sensing. So when you're doing sensing, when you're using supervised learning, our actions are predictions. Our predictions do not affect the environment whether we make a correct prediction or a wrong prediction, we're not affecting the environment. This means that we can collect all our training data offline. We collected once and we simply run through the training data again, and again, and again to flush all the corner cases.
In reinforcement learning, or in driving policy, our action affects the environment because we decide to change lanes, we decide to accelerate, we decide to slow down-- we are affecting the environment. So if I change my driving policy I need to collect the data again. I cannot do a collection of data offline. Every time I change my driving policy I'll need to collect all the data again. And because the rare events in driving policy are the accidents-- these are the rare events-- I need to collect a lot of data in order to cover these rare events. Every time I change my driving policy I'll need to collect this data again, and again, and again. Therefore it's not a positive proposition. It's not an attractive proposition. And people do not use machine learning because of that.
So the challenge is how to guarantee safety in machine learning technology. And we solved this. We have even a paper about this. The paper is about separating safety where you have a mathematical model of safety that guarantees that there is no accidents, and leaving the machine learning only for the desires. Therefore the rare events are considerably reduced because normally the rare events are the accidents. And you have a model that guarantees that you're not going to have accidents, therefore, you can focus only on the desires. And desires are not necessarily met as we saw in the examples that I gave.
So let me show you a simulation. So in this simulation we have eight cars randomly set. The red car should go right, the white car should go left. And I'm going to show three such initializations.
And you can see this is quite complicated maneuvering going on. We can measure the following statistics. We can measure what is the probability of an accident. Well it is built in a way that it should be zero probability of an accident. So if we have an accident it's a bug in the system because the model will guarantee it will not have an accident.
Then we can measure what is the percentage of success, because we know that we may not succeed in our maneuver. There could be cars that will not find their way in the right path. We would like a very high percentage of success.
Third, we can ask ourselves, what is the computing time of all of this. Because normally we're led to think that the computing time of something like this grows exponentially. Therefore, the computing time is going to be significant.
So when you run this say 100,000 runs we have 0% accidents. That means we don't have bugs. And out of the 100,000 runs only 200 failed. And when you look at these 200 you see that even a human would find it very, very difficult to accomplish because the cars are placed randomly. And sometimes the cars are placed towards the end of the stretch, there's not enough time to do the maneuver. And in terms of the computing time it takes about 1% of the computing time of sensing. So it's something that can be on the same chip on the same processing platform.
This is another example. This is one slide before the last. So this is kind of pushing your way through. So all agents here-- all agents here have been trained with this driving policy. And you can see these are kind of complicated maneuvers.
OK so if I summarize, again there are these three pillars-- sensing, mapping, and driving policy. So to do sensing right you need to solve an AI problem which we call strong perception. This is about understanding image in, story out-- not image in, objects out-- image in, they kind of curve out. Image in, story out. This is kind a strong perception.
Mapping done right-- you need to use this strong perception in order to build the building blocks to build a high definition up and you use perception in order to localize yourself in this map.
And then driving policy done right-- you need human level negotiation, but on the other hand, you need to guarantee safety. OK so these are kind of going from easy to too difficult.
So when we say that 2021 is the year where self-driving cars from a technological perspective will be on the road it is because not all the pieces are there yet. We still need to work on pieces. I believe that we were not waiting for a scientific revolution, it's only technological revolution that we need to have. And therefore a few years is something that is OK. If you are waiting for a scientific revolution, it could take 50 years. Who knows when a scientific revolution will hit us. But the building blocks are there, it's just putting them together.
So this is the end of the kind of formal talk and we can have the floor for questions. We can start with the flyer maybe. So about privacy-- so the issue of privacy is in what I told you is about this crowd-sourcing. How do we create the maps? We use crowd-sourcing. And privacy is a big issue. And the way it is going to be done is all the data is anonymized. Because it's not that we need to know where Joe and Moe have driven, we need to aggregated data in order to understand the patterns of driving, in order to build the maps. So this data will be anonymized. And if I compare this to Facebook and Google that I carry in my pocket we're talking about something which is much, much more milder than what we have today in terms of privacy.
OK so we can start with Q&A.
[APPLAUSE]
PRESENTER: If you do have questions, please find one of the microphones.
THOMAS APAGO: Yeah, if you can-- go ahead.
AUDIENCE: Hi, there. Thank you for the talk. That was wonderful. I was wondering what you thought the most pressing challenges in reinforcement learning are right now relating to the self-driving problem?
AMNON SHASHUA: So the biggest challenges of this driving policy-- we use reinforcement learning. Unlike what you may think reinforcement learning is a much more challenging problem to get good solutions than supervised learning. With supervised learning, when we build a deep network there are all sorts of magical things going on there. We can build a network that has much more parameters than the number of examples that we are using. They normally converge to a local minimum, which is a very good local minimum. And then of course, there is lots of research to try to understand why this is so. But from a practical point of view these networks work very, very well.
This is not the case of reinforcement learning. If you try to re-create papers using reinforcement learning to solve all sorts of interesting problems you'll be greatly disappointed. So there's lots of tuning to understand what works and what does not work with reinforcement learning. And reinforcement learning is really the bedrock of solving the driving policy.
AUDIENCE: Can I ask about in terms of 2021, I think a lot about unprotected left turns in the U.S., or say unprotected right turns in the UK or Japan. Do you think that's achievable in 2021? What are some of the challenges for situations that involve no traffic signals, vehicles coming at high speed, and sort of the negotiations with other drivers. Do you think we can do that?
AMNON SHASHUA: I think we can do that. It's all about guaranteeing safety. But I need to qualify this, that means for this industry to work we cannot assume zero accidents. There's no such thing as zero accidents. Zero accidents mean that we don't drive-- simply stay there, stay put and don't drive.
So I would compare it to the industry of airbags. So we all know that airbags save lives. What you may not know is that airbags also kill people. When they deploy at the wrong time, at the wrong speed, at the low speed, you hit the curb and all of a sudden the airbag deploys and breaks your neck. It happens. And it happens every year, you simply don't know about it. And society has learned to live with it because on one hand-- because the chance of something going wrong is infinitesimally small, and on the other hand it saves millions of lives. And society knows how to live with it.
The same thing could happen with autonomous driving. If one can show that the chance of something going wrong is infinitesimally small. So let's try to think about this. What does that mean? Take for example the U.S. They have about 35,000 fatalities every year due to accidents. If we can reduce this by three orders of magnitude, let's say 45 fatalities a year. And these 45 fatalities would be because of something went wrong in an unprotected left turn or something like that.
AUDIENCE:
AUDIENCE: Are those the kinds of numbers you're imagining in 2021, or that's hypothetical?
AMNON SHASHUA: This is hypothetical, I don't know now what we're going to reach.
[INTERPOSING VOICES]
AMNON SHASHUA: In order for society to accept it one will need to prove that you can get to that point. So in 2021 you're not going to see autonomous cars driving without a driver behind the steering wheel. It will take years of collecting data, making sure that these vehicles are safe, and answering this question-- what is the probability of an accident. And if one can answer this question that probability of an accident went down by three orders of magnitude, maybe 2 and 1/2-- maybe 100 fatalities a year could be acceptable to society-- then this could be acceptable. So it is not that the goal is to reach 0% accidents. The goal is to on one hand flow in traffic like humans yet have a model that guarantees safety. At least safety in the sense that it's not through your actions that an accident has been created. I mean if I drive and somebody hits me from the side, there's nothing I can do about it. So this is an answer to that question.
AUDIENCE: Thank you.
AUDIENCE: Hi, my name is Sean Jane, I'm a student here studying computer vision. Thanks for the talk. I was wondering how does Mobileye protect its IP given that this is such a hot field, and employees are poached. Or employees leave and form their own companies.
AMNON SHASHUA: OK, that's a very good question. So we do all the things that the companies do. We have patents, we have trade secrets. But we have something that you don't have here in the U.S. We work in Israel. Israel is a bit different, people are much more loyal to their organization.
[LAUGHTER]
AMNON SHASHUA: We've never had an employee leave and move to another company that's competing with us. And we are already 18 years in business. We have never had a knowledge leakage, which you find a lot in Silicon Valley. People move from company to company, from company to competitive companies and so forth. So I think being in Israel was really a blessing. Now that will be parting from it it will be my challenge how to preserve this. This is how we protect ourselves.
[LAUGHTER]
AUDIENCE: Hi, I think in your examples you mostly were describing situations where the cars don't talk to each other, especially in driving policy. How much of a difference would it make-- either some lightweight discussion, communication between the cars? And what percentage-- does it have to be 100%? Or do you get significant benefit if some percentage of the cars can communicate with each other?
AMNON SHASHUA: OK so you're talking about what is known in the industry as V to V-- vehicle to vehicle communication. And vehicle to vehicle communications is a good idea. What it can give you-- it can give you ability to detect an object which you don't have a line of sight to. Because object that you have a line of sight you have sensing information. Now is this a necessary condition for autonomous driving to detect vehicles that you don't have a line of sight to? The answer is no. Because if the answer was yes we would all stop doing any activity in autonomous driving because V to V communication-- ubiquitous V to V communication-- that all cars have the ability to communicate and send their precise location is not going to happen for the next two or three decades. Once it starts it'll take a long, long time until all cars would have this V to V communication. So I think V to V is a good idea, but it is kind of orthogonal to the activity of the autonomous driving.
Humans can drive without having this superhuman capability of detecting other road users without having a line of sight. And if we are not drunk driving and then if we are responsible we have very, very small percentage of accidents. And we would like robotic cars to achieve the same level of performance. And it is possible because we have a proof of concept, and that is humans.
AUDIENCE: Hi, thank you, Amnon. So I have a lot of-- thank you very much for the talk. It's very helpful for this audience and also for me. And I have a question about the perception. So we know that, as you said, the perception right now is focusing on the object detection that provide a bounding box for each type of objects. And we know that there will be two type of errors-- false positive and false negative-- right? Right now your product mostly focusing on the safety feature. Which is OK to have some false negatives. Like the missing detection, because people may not notice that. But if it comes to the self-driving car, we have to have a system that absolutely no missing detection, because any missing detection would be a fatal crash. And we know that Tesla had two deadly crashes one with a big truck-- a white truck. And now there is in China-- not many people know that-- is crashing to a special trash truck. And so to my understanding it's really hard to prevent accidents with these kinds of-- types of objects. For example the cars with special paint or the people carrying a tree-- or a Christmas tree, something like that. So I think we may need a more general object detection rather then like these type of supervised training on a specific type of object. I would like to ask your idea about that.
AMNON SHASHUA: OK, so it's a great question. Every time Tesla is mentioned, I get I get upset.
[LAUGHTER]
AMNON SHASHUA: And I will explain why.
AUDIENCE: Sorry.
AMNON SHASHUA: But it's a great question. So driving assist today is technology to prevent accidents. Now the driver is responsible. The driver is holding the steering wheel. The driver is driving, the driver is responsible. So what you really want to optimize in a driving assist is to have zero false positives. You are willing to have a certain small level of false negatives. But you want to have zero false positive, because imagine you are a layman driving a car and all of a sudden the car has an emergency braking for nothing. There was a shadow on the road and your car stopped. You'll simply take that car and return it to the dealer. You're not going to drive this car again. So you really need to reach a zero level of false positives. Then maybe it will happen once in the lifetime of the car, or something like that.
When you're talking about autonomous driving you have to have zero false positives and zero false negatives. The way you reach zero false negatives is you have multiple sensors-- you have multiple modalities. Whereas in driving assist you mostly have one sensor, the predominant sensor is the camera. In some cases, in premium cars you have a camera and the radar. In autonomous driving you have-- we're talking about every area on the field of view to have at least two sensors and modalities.
Now the crash-- the Tesla crash had nothing to do with this false negative false positive. There was a NHTSA crash report. And the NHTSA crash report said what we said earlier, is that the crash happened outside of the design of the system. The system was designed for rear end crashes. This is driving assist. The accident was t-bone. Now the sensors of the car were not designed-- especially not the camera-- designed for t-bone detection. It's designed for rear end crashes. Now that does not mean that we cannot do t-bone detection. But in the system of Tesla there was no t-bone detection. Tesla came out with stories about a white rock and sun and so forth, and so forth. This made us very upset.
Then there was a NHTSA crash report which said this was outside of design parameters of the system. We're not talking about limitations of sensing. Sensing-- and the way we do sensing-- use deep learning, use data driven techniques, use multiple sensors. You can reach 0% false positives and 0% false negatives. And this is the easy problem among all the problems that I mentioned. People tend to focus on that because you can tell stories. I'm holding an umbrella, I'm waving my hand. All of this is easy problems. None of them are difficult problems. The difficult problems are the other problems that I mentioned, which people don't talk about.
AUDIENCE: Yeah, sure. Now can I ask another question about the hard problem? So you talk about the driving policy. And you share that very cool demo that the car can negotiate in that double merging case. But I wonder-- I have a question that how it can transform to the real scenario? Because the challenge for the reinforcement learning is that what you train is how to react to yourself. Like the other car was the same policy. But for the human it may have a different kind of model, a different kind of mental state. So it may have a different policy, different decisions.
AMNON SHASHUA: It is a great question because I didn't touch about it. What I talked about is the robotic car driving policy-- having a model which will guarantee safety. And how would one use machine learning such that we can guarantee safety on one hand and drive like humans on the other side.
But now comes a new question, how do I validate this? So let's assume that I'm a regulator and there is an operator that wants to put self-driving cars. And I want to measure the probability of an accident. How do I do that? I cannot do it on a test truck. What I'm going to drive in a test truck and say everything is OK? Test track doesn't reflect the complexity of the real world. Am I going to drive around my block one million miles and say, I drove one million miles and everything is OK? This is what people do by the way.
How do you go and validate this? So one way to validate is more or less what you are saying. Let's try to build a generic model of how people of humans drive. We're talking about human driving policy not robotic driving policy. Let's do something similar in the area of pictures. We have these gun-- the generative adversarial nets that create realistic pictures. Why not try to create realistic trajectories by collecting a lot, a lot of data and let's create realistic trajectories of how a human drives. Using the high definition maps create a computer game where we have agents driving on realistic roads. And the trajectories of their driving paths are mimicking human drivers, including reckless human drivers, and so forth. And then we take our vehicle with our robotic driving policy and now we drive in the simulator and we drive millions of times-- infinite number of times. And we need to prove that we don't have accidents. And this is still an open problem.
And this is this would be the way, I believe, in which these technologies would be validated. Otherwise how do you validate this? Wait years until you show that you're testing a fleet of 1,000 vehicles and all these measures of how much time I hold the steering wheel. It's all very, very misleading. Because I can drive in simple areas for one million miles and show that I don't touch the steering wheel. And I will avoid going into complicated areas because I don't want to do mess my statistics. Right?
AUDIENCE: That's right.
AMNON SHASHUA: So this is an open problem.
AUDIENCE: But I think--
AMNON SHASHUA: But I think others want to ask questions.
[LAUGHTER]
AUDIENCE: Sorry. In Thank you very much.
AMNON SHASHUA: Yep, this side.
AUDIENCE: Hi. Thanks for the talk. I wanted to know what differentiates the big players in the entire autonomous driving industry. And, for example, how does Mobileye differ in how the engineering and technology from other companies? Is there generally a metric for measuring like which companies are doing better than other companies right now?
AMNON SHASHUA: Well, autonomous driving is not out there. So all what we know about our engineering efforts are science projects basically. So it's very, very difficult to tell what others-- what the performance of others-- All what you are exposed to are some test vehicles that reporters are driving. So you don't really know what is out there. But you do know that there are different approaches. And those are the two approaches that I mentioned-- the map heavy and the map light approach. I believe that we are the only ones with the map light approach. And this approach is very attractive to the car industry because it really fits in the way they see the world in terms of the cost, in terms of the scalability. It is the way that they can leverage their size. If you have millions of cars, these millions of cars can generate the maps. Using crowd-sourcing you reduce the cost of the map. Everything fits in the way the car industry looks at things. In terms of performance there is no production worthy vehicle out there. It's all testing vehicles. So it's difficult to say who is the king of the hill right now. There is no king of the hill. It's all science projects. And it will take a number of years until we'll start seeing these vehicles in production at some point.
I open here a can of worms in terms of the answer. The way this is going to unfold is that it's not we're waiting now until everything is perfect in 2021 where we would have mobility on demand with vehicles that can drive without a driver. There are going to be steps on the way in which you are going to have like a test on autopilot-- but better, and better, and better. The driver still needs to be alert, but you can get performance which is going to be very, very similar to a performance of true autonomous driving. But you have to be alert in order for suppliers to be able to fine tune the technology. Because if you wait until the technology is perfect, it's not going to happen. You need to test. And you need to test not only with a fleet of 10 vehicles driving around the block, you need to be able to test using tens of thousands of vehicles. And that you can do when you have vehicles in production.
So they are going to be-- and there are a number of programs starting from 2019 in which you would get kind of a test auto pilot, better than what it is today by many car manufacturers. You are going to see 2021 also vehicles that are limited only to highway driving, but safe highway driving. This is called level three. So that one can have tens of thousands of these vehicles also generating data for the kind of driving policy that I mentioned before for generating data for learning human driving policy. It's not that quite, quite, quite in 2021 all of a sudden we're going to have mobility on demand. It's going to be a phased approach.
AUDIENCE: Thank you.
AUDIENCE: Thank you for your talk. I think we could agree that your camera is a safety critical piece of equipment in your system, correct? And then you used the analogy of the map as being the redundancy for the camera. Do you class your map data as a safety critical piece of equipment? And treat it as such? Will it have that kind of requirement?
AMNON SHASHUA: Well, the map of data is safety critical. But it's not a single point of failure because you have also the sensing. As I said, the strong perception-- the sensing should be sufficiently advanced and sophisticated to understand the drivable paths even without the map. And then the map becomes a redundancy to the sensing. But the map should be accurate all the time. The way the map is accurate all the time is that you build it through crowd-sourcing. Therefore, you have millions of cars always generating data. So once something changes in the environment it's almost immediately being changed in the map and then being transmitted to the cars. So in terms of the communication you have uplink and downlink. The uplink is a simpler problem, because we're talking about 10 kilobytes per kilometers. So if you're driving 100 kilometers it's one megabyte. My smartphone sends much more than that. Then there's the downlink. The downlink is you are going to send data say over 100 kilometers square map data. So that's not going to be one megabyte. We're talking about 5G communications that will be coming out in the next few years. So when we're talking about autonomous driving on the downlink you need to think also in terms of 5G networks. And then you can have continuous updates of these maps wherever you are driving. And because you have millions of cars doing crowd-sourcing, the map should be always, always correct.
AUDIENCE: Right. I'm just trying to tease out the distinction between a map which has to be very good versus a map that has to be something that is life dependant.
AMNON SHASHUA: The map should be very good.
AUDIENCE: But not life dependant?
AMNON SHASHUA: It has to be very good and live. It has to be updated all the time.
AUDIENCE: OK.
AUDIENCE: As a Mobileye shareholder and a Tesla shareholder, I never understood what really happened at the divorce. Is there anything you could share with this intimate group?
[INTERPOSING VOICES]
[LAUGHTER]
AMNON SHASHUA: No, I cannot share anything. We had an ugly divorce. And at some point we said we're not going to comment anymore. And then Tesla stopped commenting anymore. And now we're all happy. So I'm not going to comment any more about Tesla.
[LAUGHTER]
AUDIENCE: So I have a question about the front facing camera. So if all the sensing technology is very dependant on this camera, what will be the desired specifications for this camera? For example, the frame rate, exposure, that kind of stuff. Because I can imagine when you're driving the side will be passing by very fast, right, so if you used a low frame rate you won't be able to capture as much.
AMNON SHASHUA: So automotive cameras are slightly different from consumer cameras. Automotive cameras you need very, very good low light performance. The pixel sizes are much larger than the pixel size of consumer cameras. So for example, my iPhone-- I think the pixel size in an iPhone is 1.6 microns pixel size. The pixel size in automotive cameras are 4.5 microns. So they gather much more light. So this is why the automotive cameras have low resolution. The most advanced cameras out there in production have 1.3 megapixels. We're talking about an order of magnitude compared to consumer cameras. But there are also tricks that will come out in 2019 of how to get high resolution cameras. And this is using binning. So analog building is that you're looking at super pixels, say two by two pixels, three by three pixels, and treating them as a single pixel in terms of light collection. And in analog binning you can do this in frame rate. So you are basically trading resolution with light sensitivity. So when you have enough light you can have the full resolution. You need more light, you'll reduce your resolution by having these super pixels. And you can do it at the framed rate. So cameras starting with 2019 would have an eight megapixel resolution, which is starting to be interesting from a point of view of consumer cameras with this analog binning criteria. But we're not talking about only front facing cameras. It's cameras 360 degrees. It's about seven to eight cameras around the car.
AUDIENCE: I see. Sorry, I just have a very quick follow up. If you have a very fixed frame rate-- for example, at night there might be a situation where the car coming in front of you has a very bright light. So with a framed rate that would actually saturate the camera. So do you have some sort of feedback--
AMNON SHASHUA: Yes, so there's a lot of sophistication going on in camera processing-- not the computer vision, the camera processing. How to use multiple exposures. So because we can change the gain of the camera and the gain curve at framed rate. So we crank the camera at about 60 frames per second even though we want to work at 30 frames per second-- in some cases 10 frames per second. So the frame rate depends on the task. Not everything is done at 30 frames per second. Sometimes 10 frames per second. But the camera runs at 60 frames per second. And we're using multiple exposures in order to get this high dynamic range. That's a lot of sophistication on the camera control.
AUDIENCE: Thank you.
AMNON SHASHUA: OK.
AUDIENCE: Hi. I have one kind of medium length question. So you mentioned that we could have self driving cars by 2021 if you have the technological revolution. Do you think there could be another technological revolution after that to maybe enable superhuman driving where self driving agents don't really drive like humans, but they drive better than humans? Maybe by controlling more surfaces on a car, or maybe kind of making use of the physics behind it so that they can just drive down at 200 miles per hour on a highway.
AMNON SHASHUA: It is a great question, because when I talked about maps nobody asked me here, why do we need maps to begin with? Right, because we humans-- we don't need a map. We need a navigation map to know where to drive, but we don't need a map in order to drive, right? And if we want to mimic humans then let's go for it. Why do we need a map? The reason for the map is that we know that human intelligence in terms of sensing and all around driving is so high that we really will need a scientific revolution to reach that level. Forget about all the hype that people talk about AI. We'll really need a scientific revolution to reach anything close to human perception. The map is a way to lower the bar. This is why we need a map. We're giving the system something that humans do not have.
AUDIENCE: To make life easier basically?
AMNON SHASHUA: It will make life possible.
[LAUGHTER]
AMNON SHASHUA: Without it-- we're here at MIT, we're talking truth. Forget about all this hype. We're very, very far away from anything that is even close to human capabilities. Very far away. The map is a way to bridge the gap. And it's a huge gap. This is why the map is so, so critical.
AUDIENCE: OK. Thank you.
AMNON SHASHUA: And let me know when to stop. OK?
[INAUDIBLE]
AUDIENCE: I'm the last? OK, who cares.
[LAUGHTER]
AUDIENCE: So how do you actually-- you have some desire and you have some policy which are kind of constraints you need to satisfy, and they use [? RL ?] so that it will come up as a policy to satisfy these constraints. How do you actually do it?
AMNON SHASHUA: Well, read the paper that we wrote. But in a few words it is very close to the AlphaGo-- the Go playing reinforcement learning by DeepMind. But there also you have three of the possibilities that you need to traverse. You learn how to traverse these tree using imitation learning. And then on top of that you are using reinforcement learning to find the most likely path along a longer tree. Something very similar is happening in what we are doing with addition where we have a mathematical model of safety. This is something that was not done before to guarantee that we'll not have accidents. And in that way we move all those rare events that we would need to collect a lot, of data in order to find those rare events. Because the model guarantees that there is not going to be an accident, we can focus only on the desires.
AUDIENCE: So that's orthogonal to the learning? That's kind of pruning the tree afterward as a constraint?
AMNON SHASHUA: It makes the learning possible, because otherwise we'll need to collect a lot of data in order to find those rare events which are the accidents. And we will have to collect this data again, and again, and again.
AUDIENCE: Cool, thanks.
AUDIENCE: Hi. So you're talking about millions of users uploading the map information and you combine that information in your cloud-- in your server site to create a very precise map. But what about those places that few people go?
AMNON SHASHUA: By 2020 every new car in Europe and the U.S.-- and I believe also in Japan and in China-- every new car will have a front facing camera. You see that also the number of chips that Mobileye is selling is almost doubling every year starting in 2012. So in 2016 we sold about 6 million chips, or 6 million cars. All right. By 2020 every new car is coming out with a front facing camera. This front facing camera will have the ability to send the data to the cloud. It is reasonable to say that tens of millions of cars will be sending data. So there's not going to be a place where no car is passing. And if no car is passing then you don't need to have autonomous driving there because there is a reason why no car is passing there.
[LAUGHTER]
AUDIENCE: But how many users do you need for a specific position or place?
AMNON SHASHUA: It's a good question. Right now we're doing it with five drives. So five vehicles would need to drive. We believe we can get it down to three. But because it's a crowd-sourcing thing we're not that concerned about it-- whether it's five or three.
AUDIENCE: Thank you.
Associated Research Thrust: