Computer Vision that is changing our lives (1:13:30)
March 23, 2015
March 23, 2015
All Captioned Videos Brains, Minds and Machines Seminar Series
Prof. Amnon Shashua, Hebrew University, Co-founder, Chairman & CTO, Mobileye (NYSE:MBLY), OrCam.
Amnon Shashua holds the Sachs chair in computer science at the Hebrew University. He received his Ph.D. degree in 1993 from the AI lab at MIT working on computational vision where he pioneered work on multiple view geometry and the recognition of objects under variable lighting. His work on multiple view geometry received best paper awards at the ECCV 2000, the Marr prize in ICCV 2001 and the Landau award in exact sciences in 2005. His work on Graphical Models received a best paper award at the UAI 2008. Prof. Shashua was the head of the School of Engineering and Computer Science at the Hebrew University of Jerusalem during the term 2003–2005. He is also well known on founding startup companies in computer vision and his latest brainchild Mobileye employs today 250 people developing systems-on-chip and computer vision algorithms for detecting pedestrians, vehicles, and traffic signs for driving assistance systems. For his industrial contributions prof. Shashua received the 2004 Kaye Innovation award from the Hebrew University.
Advances in computer vision are revolutionizing two technologies that can profoundly impact people’s lives: driving assistance systems that perform tasks such as emergency braking to avoid collisions, and wearable vision systems that can perform everyday tasks that enhance the lives of the visually impaired. Amnon Shashua illustrates the capabilities of
’s driver assistance technology, which combines visual object detection, motion analysis, road analysis, and the creation of environmental models, and it deployed on many cars manufactured today. Drawing on the experiences of visually impaired users with the
vision technology, he shows how the ability to perform tasks such as reading text in newspapers, menus, and signs, and recognizing faces and objects in a scene, can improve the lives of these users.
TOMASO POGGIO: I'm Tomasso Poggio, the Director of the Center for Brains, Minds, and Machines. And I have the great pleasure to introduce Amnon Shashua today.
There are very few scientists worldwide who are capable of both making theoretical, scientific breakthroughs, and transforming them into technological breakthroughs of importance for industry and for society. And Amnon is one of these very exceptional people.
So I met Amnon first when he arrived to MIT, for graduate studies. And this was 1988, something like that. And then he joined my group as a post-doc in '92, and then we interacted on a number of technical issues in computer vision, and computer graphics. Also, I think I started his interest in entrepreneurship, which he followed very successful, as you will hear today. He was, of course, one of the best postdocs I ever had, and one of my greatest friends.
He did his Master's thesis with Shimon Ullman, on saliency computation. And later, he worked on recognition in for his PhD thesis, obtaining simple but powerful results on invariance to illumination, nation that has been since discovered multiple times. '94, around the end of his postdoctoral stage here at MIT, he wrote a paper on multiple view geometry, which introduced a fundamental algebraic relationship between three views, which is related to the trifocal tensor, so which has found many theoretical and practical applications, ranging from 3D reconstructions, calibrating cameras, robotic navigation, computer graphic animation.
In '95, he founded Cognitens, a company that used the trifocal tensor mathematics, combining optics, mechanics, physics of illumination, in a project for extremely accurate industrial 3D measurements. In 2000, he founded Mobileye, and in 2010, Orcam. They are both industrial partners of the Center for Brains, Minds, and Machines. And they are, in fact, Mobileye and Orcam, the ones I always pick, when I speak about the Center, as examples of our industrial partners and their leadership sheep in the technology of intelligence.
I am always and continuously amazed by the depth, the mathematical sophistication of Amnon's work, and by its enormous impact on technological applications. As you hear in his talk, Mobileye is an amazing case, that I think is the prototypical success story in computer vision up to today, and in machine learning, both of them, precisely because the extent and the success of the underlying mixture of sophisticated theory and impressive technological applications is absolutely unique.
I'm very proud to introduce to you Amnon Shashua. Amnon.
AMNON SHASHUA: Well, Tommy, what an introduction. I had very fond years of MIT, is one of my best five years of my life. But I forgot how cold it can be here in Boston. Today I came in with a ski mask, a ski coat, and still was very cold to the bones.
I'll talk about computer vision. And I think that the title here is very bold, computer vision that changes our world. We all have cameras with us. All smart phones have cameras. They do a bit of computer vision. When we take a picture, they'll do face detection, they can do also panoramic stitching, which was one of the major algorithms in computer vision a decade ago. But you cannot state that what the cameras on smart phones do today will change our world.
So why this statement is very bold, because I'm going to talk about something else related to cameras. It is a time in which cameras are either on us, or nearby us, and those cameras are doing computer vision continuously, not on demand, not when we take a picture there is some computer vision going on, but continuously. And those two areas are cameras in cars-- so the cameras are near us when we drive-- and wearable cameras, cameras on us. And those are the stories of Mobileye, where it's cameras in the car, and Orcam, where cameras are on us.
So I'll start with Mobileye. What we do there is a camera, front facing, looking at the scene, understanding the visual world, understanding where vehicles are, where pedestrians are, where traffic signs are, where lanes are, traffic lights, trying to get a detailed understanding of the visual field, and use that in order to prevent accidents. So let me show you. This is a clip. It's part of the Super Bowl commercial by Hyundai. And first I'll show you the clip, and then I'll explain why I'm showing this clip.
-Remember when only Dad could save the day?
-Auto emergency braking, on the all new Genesis, from Hyundai.
[END VIDEO PLAYBACK]
AMNON SHASHUA: OK, so what you saw here is the system. This is Mobileye's system, camera on the windscreen, detecting that there is an imminent a collision, warning before collision, and applying the brakes before collision. Now, why is this commercial important? Because a Super Bowl commercial is very, very expensive. Every second of air costs a lot of money.
Now, Hyundai here is showcasing their new vehicle, Genesis. Now, when you are showcasing a new vehicle, there is lots of stuff you can talk about. You can talk about the engine, you can talk about the multimedia. There are many things to talk about. And they chose to talk about active safety, which means that active safety in vehicles has passed a threshold in which there is awareness by the public, and by the regulators, such that if you have such a function, you are proud of having such a function, and you are even willing to spend commercial money in a Super Bowl, to talk about. This is why it is important.
When you look at the camera, looking, facing forward, and trying to understand the visual field, there are many applications that are being done. Part of them are safety features, like detecting lanes, detecting people, detecting cars, calculating the chance of a collision, and taking actuation, whether it's warning, or braking before a collision, recognizing traffic signs, controlling the headlights at night. And then also the functions that are leading us to automated driving. So I'll try to kind of give more details about what's going on here, through clips and simple explanations, so that we can get kind of the weight of where computer vision is going to, when we're talking about cars and automated driving.
The clip I'm showing here is Volvo in 2010 introduced the first pedestrian detection system. The camera looks facing forward, detecting pedestrians, and if the collision is about to happen with a pedestrian, the car brakes. Now they had about 5,000 journalistic events showcasing the car. They put people in the driver's seat, and drive towards a mannequin, and at the last moment the car would stop.
But once the car is out there, you can buy it. People do their own testing. So this is a clip I downloaded from the internet. This is a bunch of Polish guys. It's a bit funny, but it actually shows you what the system does.
[END VIDEO PLAYBACK]
AMNON SHASHUA: And this works out to 70 kilometers per hour, so it's not only in slow speed.
AUDIENCE: They are Slovak, not Polish.
AMNON SHASHUA: Slovak? OK, I'll change it.
These kinds of functions, it is proven that they save lives. This is a report by the Institute of Insurance in the US. And you see here that there are statistically significant numbers, minus 40% chance of bodily injury, liability, all in a system that simply only warns you against a collision. Imagine a system that brakes before a collision.
Now because it is low cost-- low cost, because it's only a camera, camera costs about $5, then you need a microprocessor to process information, but every sensor needs a microprocessor to process information-- so we're talking about a very, very low cost sensor with a proven ability to save lives. It involves now the regulators. Government regulation influences the car industry to introduce these systems-- they are called active safety-- introduce active safety as a standard fit, just like airbags, just like stability control.
And what this brings us is to trends in this industry. One is an evolutionary trend, where these safety regulations influence the car industry to put these systems as a standard fit, standard fit meaning the driver doesn't pay for the system, You get it just like you did it with airbags. So by 2017, '18, every new car on the road in developed countries, every new car on the road has, as a standard fit, an active safety system. So we're talking about tens of millions of cars, every year, with these kinds of capability.
Then the second trend-- we call a revolutionary trend, which has the potential of transforming the way we drive-- is autonomous driving capability, where you can let go of the steering, let go of the throttle and brakes, and the car will drive on its own, at first in limited situations, like highway driving, and then going into rural and city traffic, and then eventually take out the driver completely from the driving experience.
So this is an example of Nissan Qashqai. So 2014 has five star ratings, Euro NCAP five star ratings. Now in order to get these the five stars, you can see here in the next page, the car needs to undergo certain tests. And these are collision avoidance tests. So this shows that already starting 2011, but 2014 what was the big change is that if you want to get your four and five stars, you have to have these kinds of systems. So again, by 2017, '18, every car would have these kinds of technologies.
This is in the US. Same thing is being planned for the government regulations in the US. It's called CIB, Collision and Imminent Braking. By 2016, all new cars that they want their five stars will have to comply with these functions.
We see this also with there the effect of regulatory involvement in the sales of Mobileye. So Mobileye produces a chip. It is a system on chip, and with algorithms to do this visual interpretation, to do the computer vision. So we started launching 2007. So the five years between 2007, 2012, we shipped one million chips, one million cars with this kind of technology. 2013 alone was 1.3, and 2014, it's more than doubled. So you see this kind of exponential rise.
You can see this also the number of car models. 2010, there were 36 car models with this technology, over seven car manufacturers. 2016, we're talking about 240 over 23 car manufacturers. So there's this exponential growth, because one, it's a technology that saves lives, second, it's very low cost. You have to have both of them in order to have an impact. If you have something expensive, even if it saves lives, it cannot enter. If you have something low cost, but doesn't do anything useful, also doesn't enter. You have to have both of them.
And the next stage is automated driving. I'll show you a clip that we prepared for our road show, half a year ago. It shows me driving and talking to-- me not driving, and talking to the camera. What I am saying is not important. I just want you to see what I-- so I'll--
AMNON SHASHUA: OK, so you don't need to hear me. The fact here is that you can see that I'm for extended period of time, not even looking at the road. This is my personal car, by the way. It's a completely autonomous-- I can drive, say, from Tel Aviv to Jerusalem, I don't touch the steering wheel, even once. It is very, very robust, and it's going to be launched-- this kind of technology is going to be launched as early as three months from now. Tesla already announced three months from now, you can have hands free driving. It's going to be a bit limited to what my car can do, but that's only the start.
By 2016, there's going to be new launches by GM, and Tesla, as well. 2017 and 2018, additional launches. Volvo is one of them. Three more car manufacturers that I cannot mention their names, also in 2017, '18. So there's a driving force behind it.
[END VIDEO PLAYBACK]
So let's start looking under the hood. So I made some advertisement, why it is interesting. Now, let's see what's happening here under the hood.
When we look in terms of computer vision, these are the main issues that need to be resolved. One is object detection. So object detection means detect vehicles, front, rear, side of the vehicle.
Detect pedestrians, also various poses of pedestrians. We need to know whether a pedestrian's looking at us, we see the back of the pedestrian. Is the pedestrian on the road? Is the pedestrian on the sidewalk?
Detect animals. Volvo is, I think in the next two weeks, launching a system that has also animal detection. So a horse, a moose, big animals. There are many accidents with animals, by the way.
Traffic signs-- we're talking about the vocabulary of about 1,000 traffic signs that the system can recognize. Traffic lights, and the stop line of traffic lights, road markings. So all of this is object detection.
Then you have visual motion analysis, vehicle motion, understanding the rotation and translation of our movement, of the car movement. A structure for motion, creating 3D from optic flow.
Collision assessment-- if we see something moving, and we are moving, are we going to collide or not going to collide? A lot of visual motion analysis behind this.
General object detection. Are we going to hit something? We don't know what we are going to hit, but we are going to hit something. Could be a trash can, or-- it's called general objects. This also uses a lot of visual motion analysis.
Then we have road analysis. Detect lanes, if we can find them. Road geometry, to build planar surfaces on the road. Path planning, to know where, if you want now to drive autonomously, where the car should be in the next seconds or so. If there are lanes, it's simple, but if there are not lanes, one needs to do something there.
Road profile, detect bumps and potholes. This is an important thing. It's both a comfort function, and also a safety function when you do autonomous driving. As a comfort function, if you have the ability to control your shock absorbers, you can pass smoothly over bumps, which defies the purpose of having the bumps to begin with. But this is going to be introduced 2016, by one of our customers.
But if you have the ability to detect bumps, you also have the ability to detect debris, hazards. Let's assume somebody threw a tire on the road. So it's a 10 centimeter object, say 50 meters away, you want to be able to detect it, in order to at least warn the driver that you are going to hit, you are going to go over debris. So that's road analysis.
Environmental modeling-- ultimately, it's giving me for every pixel in the image a label, what it is. Is it coming from a barrier, coming from a curve, coming from a guardrail? Is it coming from a vehicle, the side of a vehicle, front of a vehicle? Is it coming from a pedestrian? Is it coming from a pole? Anything. Every pixel in the scene tells me what it is, in order to create a complete picture of the environment-- say, 180 degrees-- create a complete picture of the environment, in order to path plan the car in an autonomous driving situation. This is really science fiction, when you think about what computer vision can do.
So I'll show you examples of all of these. This clip kind of tries to put many of these things it together. So I'll run this clip, and maybe stop with after a few frames.
OK, came I'll stop it here.
So what do we see here? So we see bounding boxes on pedestrians, cyclists and pedestrians. We see bounding boxes on vehicles. We see this green carpet, which tells me where the free space is. This is a traffic sign. This is a traffic light.
And of course, the pedestrians can be detected also when they are occluded, and all sorts of things. Let me stop this again. You can see here again, this green carpet. It goes around all the places which are road-- not curb, not barriers, not vehicles, or pedestrians-- in order to allow the system to path plan.
[END VIDEO PLAYBACK]
So let me show you some-- this is an example of animal detection.
This is going to be on Volvo's. So you have a big animal that's being detected, and if there's going to be collision, the collision will be avoided.
[END VIDEO PLAYBACK]
Traffic lights. As you see here, we have here a complex scene, like city traffic. So whenever there is a traffic light being detected, together with the color of the traffic light. The system detects the traffic lights, detects the relevancy of the traffic lights, which traffic light is for straight, for left, for right, and also detects the stop line, to know where to stop. And it's going to be launched in a few months, by one of the car manufacturers.
[END VIDEO PLAYBACK]
Pavement markings-- what you see here, detection of pavement markings, arrows and so forth. This is important for autonomous driving. I'll skip this.
Pedestrian detection, but with understanding the pose. For example, what's written here is on road, these pedestrians are on the road, on pavement. We know that these pedestrians are on pavement. Later, we'll see pedestrians with their pose. For example, it's written here that's the back and front of the pedestrian.
So a better understanding of what pedestrians are doing, not only that there is a pedestrian and this is the range of the pedestrian, this is the angle of the pedestrian, but what the pedestrian is doing, is also part of the activity.
This is the detection of bumps. So what we are going to see here is that there is a 4 centimeter bump here.
And you see here, this is the profile of the vision system. So the vision system detected that there is a bump, and gives also the height of the bump. Now if you have automatic, electronic shock absorbers, you can pass over the bump without even feeling it, but this kind of technology, you can use it also for debris detection.
[END VIDEO PLAYBACK]
And the next clip, what you see here is there's going to be a 10 centimeter height object. These blue points are magnified here. This is a zoom of this area. And the blue points will turn into red, when the system detects an object, which is 10 centimeters height and above.
Let me run this slowly, so you can see how these points turned into red. The next one is just a plastic bag on the road, so there's no points turning into red. And this one here is another real object, and you see how the points--
[END VIDEO PLAYBACK]
So this is the ability to detect debris, and this is using visual motion understanding, plane plus parallax. Whoever here has an education in computer vision, this is plane plus parallax.
Next thing is path planning. So with path planning, we want to be able to tell where's the path that the car needs to go. Imagine hands free driving, the car needs to decide the path. So now, in many cases you have lanes. If you find lanes, then the problem is simple. You simply do a polynomial approximation to the shape of the road, and you can follow this polynomial approximation. But imagine that there are no lanes.
So I'll show you several examples where you see-- you look at the clip, you don't see lanes. Yet a human driver can easily determine the path. And what we are doing here, we use holistic information. We're using context, so it's not only a bottom-up process which tries to find lanes, because that process will fail. There are no lanes. It uses all the information that there is in the image. For example, they could be guardrails, barriers, other cues that people extract from the image, when they look at such an image.
And we use a deep network, in order to go from input to output. We have lots of deep networks in the system. I'll show two of them. This is one of them. So let's have a look at this.
So you see here that even though there's no lanes, this green line is the path ahead. The system correctly determines the path forward, even though there are no lanes in the image.
[END VIDEO PLAYBACK]
Let's look at this one here. This is city traffic. You would like to have support hands free driving also in city traffic.
But there are no lanes in city traffic, if you look at this clip. Simply no lanes, yet you want to be able to understand that there are two lanes here, and this is the path to follow, even in city traffic. Because there's lots of holistic information that allows you-- for example, the curbs here-- that allows you to figure out what's the path to drive, even though there are no lanes.
[END VIDEO PLAYBACK]
Next one is being able to determine the free space. So free space where I'm allowed to drive. And on the edges of this free space, tell me what it is. Is it an edge with a side of a car? Is this an edge with a curb? Is this an edge with a guardrail? Is this an edge with a pole? And so forth, in order to build a complete environmental model.
So what I'm showing here is this green carpet is the free space. And then on the edge of the free space, there are three codes. One is the car, which is the blue. The red is the physical edge, which could be a curb, or a guardrail. And purple is the side of a car.
Let me run this, so you can have an appreciation of what this does. And this is based on single image understanding. There's no need for motion analysis here.
OK, so this is a deep network that combines a convolutional net with graphical models all put together, running at 30 frames per second, on a single on a single layer chip, and taking only 5% of the chip capacity. So this is very, very efficient.
[END VIDEO PLAYBACK]
Here is another example, just showing the green carpet without the [INAUDIBLE] on the edges. So for example, it knows that this is a barrier. And it knows that this is also a barrier, because it sees it stops at the curb. And also, this car is a barrier, and also information here. All of this is learned through an input/output to machine, a learning machine using a convolutional net, in this case.
[END VIDEO PLAYBACK]
Here's another example. So you see, there's the green carpet here. It stopped, because there is a curb. And here it goes inside, and there is going to be another person walking with a stroller, and you see that the green carpet doesn't go over the person.
[END VIDEO PLAYBACK]
So altogether, you can think of it as that you have an input, the image. You have flow. If we have more than one camera, there is also depth. There's some coupling constraints for a graphical models, and there's a deep network going on with many outputs. One is the pixel labeling. The other one is the path planning, and then the objects, vehicles, traffic signs, pavement markings, and so forth. So it's a huge monster of a learning machine going on there.
All of this is running on a specially designed microprocessor. This is the current generation, IQ3. Which was launched October last year. First vehicle platform that was launched on was a Tesla. And a few weeks ago, it was on Audi. And during this year, there are going to be another nine launches with eight different car manufacturers of the IQ, IQ3.
In terms of the design of this architecture, it's a combination of CPUs. So it has eight cores, four CPUs for general branch and bound code, and special vector accelerators that are dedicated to computer vision. They are more efficient than GPUs and DSPs for computer vision. So it's not a general purpose chip, it will not be a chip used for coding, decoding, image. It will not be used for a laptop. It would be used only for computer vision.
Each of these core has 64 multiply accumulates per cycle, running at 1/2 a gigahertz. So if you multiply all of this together, this is 1/4 teraflop of processing power, but with a very high utilization. The utilization for convolutional net is around 0.9, which is 3 times more than what you get with a GPU, for example.
This is the next generation chip coming out this year, the fourth quarter. And this has multiple different vector accelerators. This one is the same as these vector accelerators, but there are two additional ones that cover the spectrum of flexibility, and power the complete spectrum of flexibility and power, so that you can reach a utilization close to 1.0 on all computer vision algorithms. And if you just calculate the raw power, it's more than 2.5 teraflops of computing power, without counting to the CPU. So this is very, very powerful.
A chip like this is designed to start being launched 2018. Can connect eight cameras, and doing computer vision simultaneously with eight cameras, three cameras in the front. The reason why you need three cameras in the front for automated driving, because you want a very wide field of view, just like a human driver sees about 180 degrees. But if you have only one camera with a fisheye, then you don't have range.
So you need more than one camera. So one is a fisheye, another one is a medium range, another one is a narrow field of view, such that together you can see up to 300 meters, and also 180 degrees. So three cars in front, and then four cameras surround another narrow rear camera. So altogether you have eight cameras, and together also with radars and lidars, all feeding into one chip.
And the way this autonomous driving is moving forward is this year, three months from now, you are going to have hybrid capability of autonomous driving, where these systems can-- if you're on a highway, you simply let go of the steering wheel, it will maintain its path, and also change lanes whenever necessary, but only in a highway setting. And Then within 2017, '18, these systems will move into rural and city driving, but under the assumption that the driver is behind the steering wheel. And the driver has a grace period somewhere between 10 to 20 seconds. When the system thinks that the driver needs to take back control, there are still 20 seconds time to wake up the driver, and have them take control. So it's really around the corner. We're talking about starting now, till 2018 you can have a system which can do 90% of the driving experience.
OK, so this was Mobileye. And when you think about this level of computer vision, it is really the incarnation of artificial intelligence. If we thought that the first incarnation of artificial intelligence will be in robots, humanoid type of robots, Asimov type of robots, we know today that this artificial intelligence, the first incarnations is really cars. Every car with these kinds of systems has the ability to interpret images at a very, very high level. Some of these functionalities even exceed human level recognition. The ability to detect pedestrians here is better than humans. We know that, because we tested, we validated, we compare it to, we take all the errors of these systems, show them to humans. They cannot do better.
And as we go forward to really support complete autonomous driving, you're talking about superhuman capabilities. And again, this is around the corner. We're not talking about science fiction. So this is why this kind of technology changes our world. Why it changes our world, because imagine the driverless car in terms of a potential transformative way of the way we drive. So if you have cars that are accident free, then perhaps you don't need all this passive safety that takes a lot of weight in the car. You can simplify the manufacturing, and then the design of the car.
It could also change completely the way we own cars. You can have an Uber type of application, in which a driverless car comes to where you are, takes you to where you want to go, and there could be many, many of these cars. So you don't need even to own a car. And again, we're talking about things that are not science fiction. Somewhere between now and 10 years from now, these kinds of things will be around us.
And computer vision is the primary source of all of this. [INAUDIBLE] is the primary source. There are radars, there are lidars, but all of these things are useful, but they cannot cover the spectrum of what a camera can do. A radar is good at certain things, is very bad at other things, and a laser scanner is good at some things, very bad at other things. A camera can be good at everything. Just like we humans with our eyes, we can negotiate the visual world.
And what is unique about the camera, that it is very, very low cost. It's a few dollars. So if we have a car with many cameras around, the cost would still be very, very low. In order for these kinds of technologies to be on every car, they cannot cost more than a few hundreds of dollars. They can't cost $10,000. They will never be mass produced. If they cost a few hundred of dollars, then they will be on every car. Therefore, it has to be based on cameras. First of all, the camera can do it. Second, it's the cost is a facilitator, is an enabler to have this on every car. So this is why it is changing our world.
Let me now change gear and go to the second area where computer vision has the potential to potential to change our world, and this is cameras on us. So now imagine, we carry a camera. The camera has human level ability to understand the visual field. What can we do with it? So I don't need a camera to interpret the visual field, and I guess you, as well.
So let's take it in steps. The first step would be, let's find a niche of society that if they had a camera on them, and the camera had human level capabilities of understanding the visual field, it will be useful for them. So the first people that come in mind are the blind.
The problem with blind is that it's a very, very small niche. For example, in the US there are about 1 and 1/2 million blind people, and their requirements are very, very complex. They need also not to hit objects, not only understand what the visual field is, also prevent collisions with objects. It's very, very complex.
But then there's another niche, which is much bigger, more than an order of magnitude bigger, and these are the visually impaired. So in the US, there's about 25 million visually impaired. So now we're talking about a significant niche of people. And these are a part of the society in which corrective lances cannot correct their disability. It could be macular degeneration, it could be age related, all sorts of things that limit their ability to handle the visual, and the visual world. They cannot read anymore, they cannot negotiate in the outdoors, daily activities.
And this segment of society doesn't have real technology to help them. People with hearing disability have good technology to help them. We can amplify certain frequencies, and all of a sudden hear. But if you are visually impaired, there's nothing you can do.
So if you had a camera on you, and the camera had say an ear piece talking to you, and the camera would be intelligent enough to understand what kind of information you are looking from the scene, and tell you about that, and the camera would be intelligent enough to know when you are looking for information-- because you don't want to be camera talking to you all the time, when you don't need to be talked at. Then, it could be very, very useful. So now let's try to imagine these things.
Let's assume that you are standing in a bus stop, and the bus is coming. You are visually impaired. You know that the bus is coming. You see a silhouette, you hear, but you don't know what the bus number is. So now let's say that the way we interact with the system is we point, because the camera can see our finger. So we point, so the camera now knows that you want information. The camera has object recognition capability, detects a bus. It knows that if it detects a bus, the information that you want to know about-- it's not that it has detected a bus, you know that there is a bus there-- it was what is the bus number? So it will read to you the bus number.
Let's assume you want to cross the street, and you know that there's a traffic light there, but you don't know what the color of the traffic light is, whether it's red or green. So you'll point again, the camera does object recognition, sees a traffic light, knows that in the context of seeing a traffic light, the information you want to know, what is the color of the traffic light.
Let's say you hold a $100 bill, or, say, $10 bill. You point at it. The system has object recognition capability, understands that you are looking at a money note, and will tell you it's a $10 bill.
Let's assume you are opening a newspaper. The camera knows that's a newspaper. You are pointing someplace on the newspaper, so the camera will now do a layout, and read to you the closest article to your finger.
Let's assume you are pointing at a familiar product. The camera has object recognition, and also instance based recognition capabilities. It'll tell you what's the name of the product.
Let's assume that you are looking at a familiar face. Then you don't need to point at the face, it will do this automatically. Whenever this face is in the field of view, it will tell you the name of the person. How do you teach it a new face? You simply look at the person, you point to him, finger and face detection initiates a learning phase. And the system will ask you, what's the name of the person. You'll say the name of the person. Next time the person appears, it will tell you who it is.
So you can now extrapolate more and more and more, what a camera can do to someone who is visually impaired. And this is what we have been building. This is so how it looks like. It's a camera-- it's a clip on camera, so it clips onto existing eyeglasses. And this is a computing device. There is a cable. The computing device is the size of your smartphone, it sits in your pocket. And the way you interact with the system is with finger pointing.
And we what I said, it can learn both faces and objects. It will read also text in the wild. If there is a street sign name, I point on it, it'll tell me what's written there. Recognize places, faces, and objects. And there are many more things that are on the roadmap that something like this can do. So let me-- one moment.
So let me show you a clip that first describes this. So this is a clip by Liat. She is visually impaired from birth, and she works at Orcam. OK, so this is an advertisement. But then later I'll show you clips from real users. But this is to give you an idea.
-Hi, I am Liat, and I am visually impaired. I want to show you today how this device changed my life.
-Great. Let's go there.
-Red light. Green light.
-50 shekel. Let's buy some coffee.
-Breakfast. Bagel plus coffee with cream cheese, croissant, yogurt, cream with fresh fruit.
The Fresh Paint Contemporary Art Fair began six--
[END VIDEO PLAYBACK]
AMNON SHASHUA: OK, I'll skip this. This is the way you teach--
-Start object learning mode.
[END VIDEO PLAYBACK]
AMNON SHASHUA: OK, I'll skip this.
So we started 2010. For three years we developed the capability, hardware, and software. And June, 2013, we had John Markoff from The New York Times-- he's a science reporter from The New York Times-- he came for a visit, and he wrote a very nice article about Orcam. So we decided that at the time when the article appears, we'll launch the website of the company.
Until that point, we didn't have any website, we were working in stealth mode. And that would be an opportunity to build a user base, a real user base, not people that we pay money to, to test our device, but real customers. And when we launched the website, we wrote the price of a device is $2,500. And we said the first 100 people that will buy the device will get it by September. That was June. So within an hour, those 100 people-- it was sold out. And then we kept on a waiting list. We have more than 25,000 people on the waiting list right now.
So those 100 people did not get the device in September. They got it around January. And throughout 2014, we were working with them to get feedback, improve the device, better understand how real visually impaired people interact with such a device.
One of the things that we learned that it's not a device you can send over a mail with instructional video. You need to have a hands on training to show how the device works, how do you point, and so forth.
Second thing that we learned, which was kind surprising to me, was the fact that-- now, with Mobileye, the technology must be perfect. You cannot have a false braking. You are driving leisurely, and all of a sudden, your car thinks that there is a pedestrian in front, there's nothing there in front, and all of sudden applies the brakes. That's a catastrophe. It has to work perfectly. But I thought that in the context of Orcam, since these people don't have an alternative-- so if the system sometimes doesn't work well, it's OK.
It turns out that I was wrong. People find ways to compensate for their disabilities. So for example, if you are reading, they take the text and put them one or centimeters before their eyes, and they read letter by letter. Sounds awkward, but you get used to it after awhile.
So they have ways to do compensate for their disabilities. That will change their way, they'll move to a new technology, only if this new technology works consistently, always. If it sometimes works, sometimes it does not work, they'll revert back to their old habits.
So we learned that during 2014. For example, there were issues with low light that we have improve, also replace the camera to a more sensitive camera. And this process ended around two weeks ago. Right now upgrading and adding 50 more new users, then we'll add another 50, and by summer this year we are going to launch this device. So let me show you some examples from real users.
So I'll tell you three examples, and these are really illuminating examples. The first example is Marcia. She's from Brazil. So the 100 users are only American, only US. The system only speaks and reads English. Later, we'll add more languages, but at the moment it's only English.
Marcia, she's from Brazil, so she didn't agree to accept a no answer. So she took a plane and came to Orcam. So we were so impressed. So we said it's not 100, it's 0 of 100, 100-ish, and we added her.
So while she's being trained, somebody took a iPhone and took a video. What's interesting about this clip, two minute clip, first is the body language. Brazilian, you can see a lot from the body language, what the system does to her. Second, she also explains how she copes with her visual impairment, especially how you distinguishes between different money notes. So let me run this.
AMNON SHASHUA: The system is reading a newspaper. That's not important.
-This is fantastic! Fantastic!
-$50! Cinqueta dollars!
-Cinqueta. Let's see if you get better. Yes?
-Green, all green. And I put mark color. Yellow, green, orange-- different note.
This is why it's-- Again, again, again.
-$20. [SPEAKING PORTUGUESE] It's not [INAUDIBLE].
[END VIDEO PLAYBACK]
AMNON SHASHUA: OK, I can show the next clip. Next clip is Debbie, again, one of the 100. She appeared in AIPAC last year, in front of 14,000 delegates, and she gave a 14 minute show. Brian was the host. And she explained how the system worked, she demonstrated it, and so forth. I'm not going to show you all 14 minutes. But the last minute, the host asks her about the impact of such a device. And she answers quite nicely. Let me show you.
-For an example, I was invited to a restaurant by a friend. And usually when we sit down, we would be presented with the menu. My friend would then read the menu, place her order, and then she would read the menu to me. But Brian, this time was very different. I had Orcam with me, and I was able to read the menu myself.
I was able to place my order. I was able to-- it was just so fascinating. I was able to continue my conversation with my friend, without my friend being focused on my disability. For the first time since losing my sight, I was able to feel like a normal person.
[END VIDEO PLAYBACK]
AMNON SHASHUA: The last one is following. We had over the year, many requests by research teams to use the device as part of the research with the visually impaired. And we resisted it, because we knew the device wasn't mature enough. But two months ago, we started releasing to one of these research groups. So this is an abstract of a paper that they wrote.
What they did, they took nine devices, gave them to nine visually impaired, and they let them use it for a month. And after a month, they would interview them in order to understand the change in the quality of life. And eight out of nine reported a significant change in quality of life. And then they sent us the interviews.
So the next interview is interesting, because during the interview the interviewer asks the user, tells her that it costs a lot of money, the device. That's one thing to say, oh, it's a great device when you get it for free. But if you need to pay money, then OK, it's not that great.
So he told her it's going to cost a lot of money. He told her $2,000. It's going to cost more, but never mind, even $2,000 is expensive. And her answer is very, very illuminating. So let me show this.
-The first few days I had the Orcam, I was in total awe of it because, for the first time I was able to open mail and read it, instead of having my husband read my mail. And I was able to go to a restaurant and actually read the menu, and order myself with the waitress. And that was exciting. When you can't do something for such a long period of time, the Orcam was incredible.
- --I believe is what the estimate is. Do you think such a high price would be something people would be willing to pay for a device like this? Do you think it's marginally worth it, right now?
-I think you're going to find that that's going to be on a case by case basis. People who have money, there's certainly no problem, $2,000. I don't have money. I am low income. But I would save my money, scrape it together, in order to get it at $2,000.
[END VIDEO PLAYBACK]
AMNON SHASHUA: OK. So it means that we are on the right track. So where is this going? There is a very interesting roadmap that one can apply incrementally, in order to make the system understand more and more details about the visual field. One is to do language understanding.
So once text is presented to the camera, within a fraction of a second read all the text, which is done today, and then do a text analysis-- for example, look for certain keywords words. For example, if you see a keyword, amount due, you know you are looking at a bill, say electricity bill, or telephone bill. So if the user doesn't point on anything significant, just tell him this is a bill, and this is the amount due.
Understanding for example, that it is a menu, because many of the words are food related. So you know it is a menu. So when the person points, the system not only reads the line, but looks for the number that comes later, which is the price of the food item. So have better text understanding, not only OCR level, but understanding the type of text.
Chat mode. Chat mode is-- assume you are out there, you are outdoors, you have lost orientation completely. You want the system, tell me what I see. Every frame, every second, tell me what I see. I see a tree, I see a chair, I see people. Tell me what I see.
So now if you have a few thousand categories, imagine kind of an image net-- those are in computer vision-- imagine a type of image net, or image annotation, where given an image, tell me a story about that image. This is something that is at the cutting edge of research today. This is something that can be done, and will be done, starting this summer, with this type of device. So gradually allow very, very high level and sophisticated computer vision, to cater a certain niche of people. And as I said, this is quite a large niche. In the US alone, it's 25 million people.
Where this is going, further than that. I believe that everyone will have such a device. But what's the value of such a device for people with normal sight? So we're not talking about a device which sits on eyeglasses. First, 50% of the population don't wear eyeglasses, and the other 50% that do wear eyeglasses, I guess they wouldn't like to look weird with a camera on the eyeglasses. So it's supposed to be some place concealed.
Imagine a button camera, something of this size here, which does continuous computer vision, and gradually provides you value by sending critical information to your smartphone, from your smartphone to your website, about people that you have met, places that you have been, how much time you are watching TV during the day, how much time you are spending in your office, how much time, who are the people that you have seen during the day, and so forth. Build your day in a very detailed manner, for life logging purposes. And then other applications that you can put on top of it.
This is a device that we are building now. We're already finished all the hardware phases, and now starting in the software phases. I believe somewhere between a year from now, we'll be able to launch something like this for the normal sighted people. So we're talking about wearable computing, but really wearable computing, not a smart watch which only shows you text messages, and emails. We're talking about something that does serious computing, continuous computer vision all the time, throughout the day. We need to charge it only once a day, which is a challenge. People don't charge too many devices. But again, if it provides significant value, you'd be willing to charge a device every day.
And this is where I think wearable devices are going, are going about sensors that are sophisticated enough to understand the visual world at the level that we understand them, and be a companion to us. Collect information that we missed. We have eyes, but we are not attentive all the time. Collect information that we miss, and provide that information when we need it, because it's always, always there. And again, this is not science fiction. This I believe we'll be launching a year from now, and will have a gradual and incremental growth of continuous capability, of computer vision related capability.
So this is where I feel the future of wearable computing lies, sophisticated sensors that are on us, that do real computing, not just displaying text messages. That's it. I think an hour passed.
TOMASO POGGIO: Questions? Danny.
AUDIENCE: I want to ask about Mobileye [INAUDIBLE]. So I think one of the main concerns with this kind of technology is for humans, the processing is done in such a way that when there is something that is exceptional from the standard rule, then usually the decision is a [INAUDIBLE] decision. The brain [INAUDIBLE] for [INAUDIBLE]. But in this kind of technology, who's really--
If the system is based on kind of training sets, and something which is the standard-- how will it perform where there are exceptions introduced? So in other words, how much similar do you think the system thinks like a cognitive brain?
AMNON SHASHUA: OK, I'll rephrase your question. Because these exceptions-- or let's assume they are looking at pattern recognition, and you have trained the system on a library of pictures of cars, and all of a sudden there is a weird looking car. Would you miss it, or not? The answer is not, you will not miss it. These learning algorithms generalize very, very well. I'm telling you with pedestrians, and also with vehicles, we have surpassed human level capabilities.
But your question is relevant in another area. In decision making, that could be problematic. Say, for example, you have the ability to turn away from a collision using the steering wheel, to turn away. Today, these systems, they only brake. But if you have control of the steering wheel, you can decide that you can escape the collision, rather than braking. Or you know that if you brake, you will still hit the target, but if you steer away, you can-- but if you steer away, there is now a child there.
So now you have a decision, I'm going to hit a car, or I'm going to hit a child. Which one is more important? And this is one of those things that you don't want software to decide for you. And this is an issue that no one yet has an answer to. I believe that in the end to end it will be resolved.
My forecast for something like this is the way to resolve that is that you tie your hands. Even though you can steer away from an accident, a robotic system will not steer away from an accident, it will only brake, like today's system, so then you don't have this conundrum. You only brake. You avoid imminent collision, and even if you cannot avoid it, the only thing that you do, mean you can brake.
So there are lots of ethical issues around autonomous vehicles that are yet unsolved. But the belief of the industry is that you start introducing the technologies, and then all the issues would good resolved on their own, but first start introducing the technologies.
AUDIENCE: I'd love to hear more about-- I'd love to hear more about chat mode. How do you build a visual narrative?
AMNON SHASHUA: Again?
AUDIENCE: At the end of your talk, you mentioned--
AMNON SHASHUA: Image to story.
AUDIENCE: Yes. How do you see building [INAUDIBLE]?
AMNON SHASHUA: First, there are a number of very nice academic papers on this area from Stanford, from [INAUDIBLE], from Google to Research, from Lior Wolf, from Tel Aviv University. They train both an image net, and a language model, and they combine the two through a recurrent network, such that when you get an image, you match it to a number of key words which will not tell you only what are the objects, but kind of a narrative around it. And you use a one combined deep network, to go from images to stories. So it has been done, the academic research. And this is something that is in the works, also, on the Orcam device. Yes.
AUDIENCE: Excuse me. I have a two part question. The first is about liability. It seems like even though accidents may be less frequent, or about the same, they're going to happen, maybe in a case where a person could have avoided it, as opposed to a case where a person would-- I don't know, some case. Who's going to take the liability for that? I'm sure that's an issue.
Second question, kind of unrelated followup on this other one. When there's a policeman waving people to go through the red light and go around this way, or you make eye contact, the driver makes eye contact with someone saying, wait. I assume that kind of thing is very difficult for your system to handle.
AMNON SHASHUA: OK. So let's talk first about liability. So today with the systems that avoid collisions, there isn't much of an issue of liability, because the drivers is in the loop. The driver is supposed to take responsibility of a collision. The system is only helping in taking responsibility for a collision, so there isn't an issue of liability. Of course, if you have a bug in the system, and there is a recall, then there is an issue of liability.
When you're talking about autonomous driving, then it is a real issue, because you are not in the driving loop. And let's assume that an accident happened, and you can prove, even though statistically a robotic system drives better than a human, you can prove for this particular accident, if a human was in the loop, the human would have avoided the accident.
So I'm not sure that all of you are aware, but airbags also kill people. We're talking about minor accidents. Let's say you are hitting a curb, and without an airbag you would have escaped this accident without any bodily injury, and the airbag kills you. There are about 80 deaths per year for airbags. You don't know about it, right? Because airbags save lives. So now it's a concerted effort between regulators, insurance, and so forth.
Because this technology saves lives, the fact that it also kills people is somehow managed. That means of course, people get compensation and so forth, but the pool of money for all this compensation is handled. This is why you don't hear about it. So I believe that if automated-- and here is just a conjecture-- if automated driving, statistically, is much, much better than human driving, these types of accidents that you could prove that a robotic system-- it will be handled through insurance, through-- because if for society this thing is a good thing, it will be resolved.
AUDIENCE: So sort of a very specific thing, in terms of the self-driving car. Does the current system used object permanence? If you see a pedestrian walk behind an SUV, a parked SUV, does it anticipate the pedestrian walking out, coming out the other side?
AMNON SHASHUA: Currently not, but this is part of the road map for supporting automated driving. And I didn't answer your second question there, these images of a policeman waving, and so forth. Currently the systems are not doing it, but I don't see a major hurdle in recognizing the action of a person. We are getting there. We are now recognizing the pose of the person. Is the person looking at us, is their back turned to us? Where is the person, is it on the road, on the pavement? Knowing whether the person is signaling or not is a natural growth of this kind of understanding, what the person is doing. I don't think it's a major impediment.
I think a bigger impediment to completely autonomous driving is junction negotiation, taking a left turn at a junction. It's a problem, because humans do not-- they bend the rules. They don't follow the rules, when you're-- because if you follow the rules, you will simply get stuck there. Nobody will let you. But now imagine a robotic system, a robotic system bending the rules. It's-- so I think that is the more bigger thing to think about, negotiating junctions. But a policeman waving his hand, and so forth, this is not a big problem.
AUDIENCE: Hi, you mentioned that an Orcam can learn. Have any of your beta testers found that capability useful?
AMNON SHASHUA: OK, so I need to qualify what it means to learn. In terms of recognition, there are two types of recognition. There is class based recognition, knowing that this is money, this is a bus, or this is a traffic light. It's a class of objects. And there instance based recognition, doing simply a texture matching. For example, knowing that this is a 100 note, you recognize this not as a class. You have a picture of 100 note in your database, and you do image to image matching. Of course, you don't do the raw image to image matching, there's a certain representation of the image, and you match the representations.
This is what I mean by learning. Say, for example, I wanted the system to recognize this object not to by reading, but by looking at the entire texture, and recognizing that this is an Expo Whiteboard care product. So what I would do, I would take this and wave it to the camera. So waving would be signaling to the camera, it's a learning phase.
So what this waving process would do would the camera would put a bounding box around the object, using motion. And then it will ask me what it is. I would say whatever I say. It will record my voice snippet. It will take an image representation, add it to the library, and then it will be part of the objects that it recognizes. Next time I hold it and I point on it, it will not start reading the text, it will find a match to the library, and repeat what I said when I taught the system the object. This is what I said by teaching the object.
The same thing with faces. You can teach the system a new face by pointing at the face. The system continuously does face detection. When it detects a finger combined with a face, it understands that you want to teach it, and asks you what is the name of the person, and it will not store the image, it will store a representation of the image into a database of faces. And then any time the system detects a face, it will match the representation of the face to the database. If it detects it, if it finds a match, it will say the name. If it doesn't find a match, it will say nothing.
AUDIENCE: So it looks like your systems are able to solve very complex tasks, and do so very, very fast. Is it because they are much better hardware, or is everybody else in computer vision using models that aren't complex?
AMNON SHASHUA: Well, it's a combination of both. Mobileye has a specialized chip, a system on chip. With this system on chip-- so the challenge with hardware in automotive is three things. One, you need a lot of computing power to do computer vision, this goes without saying.
Second, you need it to run in very, very low power consumption. So low power consumption is around 2.5 watts. Just to give you a scale things, the Core I7 that I have here on my MacBook Air is about 60 to 70 watts. OK, we're talking about an order of magnitude difference. So you want very high computing capability, very, very low power.
Third, you want it to be very, very low cost. Very low cost means single digit in dollars. So there are three contradictory requirements. So we designed a chip which satisfies these three requirements, and its computing power is very, very significant. So this is at very high utilization for computer vision.
Second, the algorithms you design also need to be tailored for real time processing. So take, for example convolutional nets. Let's look at the academic papers you find on convolutional nets. Let's look at Alex Net, the 2012 Hinton paper. It has 800-- so it's an image net. It takes an image, 200 by 200 image, and gives you a label category, 1 of 1,000. It has 832,000,000 multiply accumulates to run through it. So just to give you a scale of-- it will take about six seconds on a Tegra X1 chip. It'll take about more or less the same time on an Iq3 chip. So six seconds, one frame. And it has 60 million parameters.
Let's take the deep face of Facebook, and Lior Wolf. It's a network that takes a 100 by 100 image of a face, and does face recognition. Has 120 million parameters, and hundreds of millions of multiply accumulate. So this definitely cannot run in real time. We're talking about processors that's run 30 frames per second, and do things that are even more challenging than doing image to category. For example, there's pixel labeling, there's the green carpet that I have shown.
So it requires also designing your algorithms differently, designing networks that are very, very efficient, leveraging the fact that we have much more data than anything that is being run in academic research. Normally in academic research, the amount of data is not enough, and you are in an overfit situation, and you use regularization to control the sample, the complexity.
In Mobileye, we're on the other way. We are never in an overfit. We have so much data that it's a completely different problem to solve, and we design much, much more efficient networks. But this network of the green carpet, the semantic labeling, takes about 4% of the capacity of the Iq3 chip, the existing chip. It will less than 1% of the Iq4 chip. And it's only a few millions of parameters, so it doesn't take much in terms of memory.
And memory is important in automotive. The size of the Flash is very, very limited. The maximum size of flash is 128 megabyte of automotive. And this is nothing, if you are talking about networks of 100 million parameters. You already filled in 100 megabyte of flash. So it's only a few millions.
So it's a combination, designing software that fits real time processing, and designing hardware that has significant computing power at very low power consumption.
AUDIENCE: How would you compare the architecture of the Iq chip to a graphic processor?
AMNON SHASHUA: So I had here--
AUDIENCE: Nvidia, also.
AMNON SHASHUA: Let me see. I think I have here in one of the hidden slides-- let me go here. Maybe-- oh, yeah, I have it here. So this slide shows--
OK, so this is flexibility. The ultimate flexibility is a CPU, it's a branch and bound architecture, a RISC architecture. You can run any code. So this is maximal flexibility, but the performance is much lower than anything else. Because you have high flexibility, you are paying the price, in terms of performance. A DSP is somewhere here. It has much higher performance, but flexibility is much, much lower. A GPU is somewhere in between.
This is our core in the Iq3. So it has more performance for the same flexibility than a GPU. It's about three times the ratio, in terms of performance and flexibility to a GPU. And these in the Iq4, there are these two additional ones. So you see this PMA has very, very low flexibility, but huge performance. So this is before very, very specific type, so it's more like an ASIC, very, very, very specific. This NPC has a flexibility very, very similar to a CPU, but has eight times the performance.
So now if you take all three of them, you are basically spanning the entire spectrum of flexibility versus performance. So the key here is utilization. When you read about specs of chips, they all talk about gigaflops, teraflops, but they don't talk about utilization, because utilization is algorithm dependent. It's not a fixed number.
What we designed, designed cores that have a very high utilization for the type of algorithms we run in computer vision, which are different from the type of algorithms you run in signal processing, which DSP is designed for. And is different from the type of algorithms like coding and encoding, which a GPU is designed for, in computer graphics.
Computer vision has its own place in science. It's not signal processing. It's not computer graphics. It has its own place, and therefore it deserves its own architecture. And this is how we get the huge performance and power consumption capabilities.
TOMASO POGGIO: OK, one last question.
AUDIENCE: Have these systems, especially Orcam, reached back to any servers, to augment the processing resources?
AMNON SHASHUA: No. None of these systems have communication to a backbone, or to servers. Orcam could have enjoyed something like this, but we noticed that most of the users, since visual impairment is age related, much of it is age related, we're talking about people that are technology averse. Having [INAUDIBLE] communication to the cloud, and to computers, and so forth, it's nice for young people. But most of our customers would not make use of it. So with Orcam, it's all local, and has to run in real time. If you start sending images to the cloud, then coming back it will not be in real time.
With Mobileye, right now there is no communication to the backbone, except to Tesla. With Tesla, they can do upgrade over the air, which only Tesla does that. So once you have all the hardware in place, you can incrementally add software features, just like we do today with smart phones when we update the firmware.
So for example, in October Tesla launched the first mono camera with us that did only the traffic sign integration and lane departure warning. And within a few months, they added more capabilities, like ACC, like autonomous electronic braking, and high beam assist. And so they can add this over the air. But cars are not communicate with the backbone. But this is something that will definitely happen in the course of the next 10 years, that cars will communicate with the backbone, send data to the backbone. Cars will communicate between cars. All of this is in the roadmap of the industry.
TOMASO POGGIO: Very good. Let's thank Amnon.