This object-recognition dataset stumped the world’s best computer vision models

December 10, 2019

Objects are posed in varied positions and shot at odd angles to spur new AI techniques.

Computer vision models have learned to identify objects in photos so accurately that some can outperform humans on some datasets. But when those same object detectors are turned loose in the real world, their performance noticeably drops, creating reliability concerns for self-driving cars and other safety-critical systems that use machine vision.

In an effort to close this performance gap, a team of MIT and IBM researchers set out to create a very different kind of object-recognition dataset. It’s called ObjectNet, a play on ImageNet, the crowdsourced database of photos responsible for launching much of the modern boom in artificial intelligence. 

Unlike ImageNet, which features photos taken from Flickr and other social media sites, ObjectNet features photos taken by paid freelancers. Objects are shown tipped on their side, shot at odd angles, and displayed in clutter-strewn rooms. When leading object-detection models were tested on ObjectNet, their accuracy rates fell from a high of 97 percent on ImageNet to just 50-55 percent.

“We created this dataset to tell people the object-recognition problem continues to be a hard problem,” says Boris Katz, a research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Center for Brains, Minds and Machines (CBMM).  “We need better, smarter algorithms.” Katz and his colleagues will present ObjectNet and their results at the Conference on Neural Information Processing Systems (NeurIPS).

Deep learning, the technique driving much of the recent progress in AI, uses layers of artificial "neurons" to find patterns in vast amounts of raw data. It learns to pick out, say, the chair in a photo after training on hundreds to thousands of examples. But even datasets with millions of images can’t show each object in all of its possible orientations and settings, creating problems when the models encounter these objects in real life.

ObjectNet is different from conventional image datasets in another important way: it contains no training images. Most datasets are divided into data for training the models and testing their performance. But the training set often shares subtle similarities with the test set, in effect giving the models a sneak peak at the test. 

At first glance, ImageNet, at 14 million images, seems enormous. But when its training set is excluded, it’s comparable in size to ObjectNet, at 50,000 photos. 

“If we want to know how well algorithms will perform in the real world, we should test them on images that are unbiased and that they’ve never seen before,” says study co-author Andrei Barbu, a research scientist at CSAIL and CBMM. 

A dataset that tries to capture the complexity of real-world objects 

Few people would think to share the photos from ObjectNet with their friends, and that’s the point. The researchers hired freelancers from Amazon Mechanical Turk to take photographs of hundreds of randomly posed household objects. Workers received photo assignments on an app, with animated instructions telling them how to orient the assigned object, what angle to shoot from, and whether to pose the object in the kitchen, bathroom, bedroom, or living room. 

They wanted to eliminate three common biases: objects shown head-on, in iconic positions, and in highly correlated settings — for example, plates stacked in the kitchen. 

It took three years to conceive of the dataset and design an app that would standardize the data-gathering process. “Discovering how to gather data in a way that controls for various biases was incredibly tricky,” says study co-author David Mayo, a graduate student at MIT’s Department of Electrical Engineering and Computer Science. “We also had to run experiments to make sure our instructions were clear and that the workers knew exactly what was being asked of them...” 

Read the full story on MIT News using the link below.