Finding Any Waldo

digitized Waldo in a human eye
September 14, 2018

Biologically inspired computational model can efficiently and invariantly search for natural objects in complex scenes

by Kris Brewer

digitized Waldo face in human eye

Since 1987, people have been searching for the elusive red and white adorned “Waldo” in the visually crafted books of Martin Handford. This entertaining and challenging endeavor to find the bespectacled character illustrates a task that humans embark upon regularly in their daily routines. With a pan on the stove, we pull on a handle in the kitchen and seek out a spatula to flip our pancakes. How is it that we are able to efficiently find the utensil we need in that jumble of instruments in the drawer to prevent a burnt meal?

Lead author and graduate student Mengmi Zhang of the Kreiman Lab, at Boston Children’s Hospital and Harvard Medical School, discusses the process of how humans achieve this feat of visual search quickly and repeatedly and how they used this biological process to inspire their computational model in her recent paper “Finding any Waldo with zero-shot invariant and efficient visual search”, published on the Sept. 13, 2018 in Nature Communications. Humans are able to find a target selectively (the ability to distinguish a target object from distractors), in an invariant manner (finding a target irrespective of changes in appearance such as angle or rotation), efficiently (rapid search, avoiding exhaustive exploration) and to generalize to novel shapes (no training required).

Three increasingly more complex tasks were employed on subjects, using a series of psychophysics experiments, matching against a target object using eye tracking to capture their progress. The first was an object array where the subject had to match the target object to one of six objects spread out on the screen. Next they would have to find the target object in a natural scene. Finally, the subject would be challenged with finding Waldo in a scene from the book. Examples of the second and third experiment can be seen in the associated videos below where the yellow dot represents the subject’s eye movement and the red box contains the target (ex. 2 – natural design & ex. 3 – find waldo).

Learning from the behavioral data, the authors developed a computational model to gain insight into the mechanisms that guide visual search without the detriment of exhaustive sampling and while being able to generalize to new objects using an invariant visual search network (IVSN). The model is presented with the search image and uses a network resembling the ventral visual pathway to extract the feature maps of the search image. Target representation is then stored in the prefrontal cortex, which modulates the feature maps in the search image, and this results in the attention map. This model tries to transfer the knowledge learned from object recognition in the ventral pathway to localizing the target in the visual search task with zero-tuning.

Zhang concludes by stating two main highlights to come out of this research. “According to our experimental results, we found that humans can perform invariant visual search selectively and efficiently. Different from object detection algorithms, which require intensive training data in computer vision, we propose a simple zero-shot biologically plausible computational model for invariant visual search. It provides a first order approximation to human visual search behavior.”

This work was done with in collaboration with several research groups and institutes and supported by the NSF funded Center for Brains, Minds and Machines, NIH and A*STAR

Link to publication information -