|Title||Image interpretation by iterative bottom-up top- down processing|
|Publication Type||CBMM Memos|
|Year of Publication||2021|
|Authors||Ullman, S, Assif, L, Strugatski, A, Vatashsky, B-Z, Levi, H, Netanyahu, A, Yaari, AUri|
Scene understanding requires the extraction and representation of scene components, such as objects and their parts, people, and places, together with their individual properties, as well as relations and interactions between them. We describe a model in which meaningful scene structures are extracted from the image by an iterative process, combining bottom-up (BU) and top-down (TD) networks, interacting through a symmetric bi-directional communication between them (‘counter-streams’ structure). The BU- TD model extracts and recognizes scene constituents with their selected properties and relations, and uses them to describe and understand the image.
The scene representation is constructed by the iterative use of three components. The first model component is a bottom-up stream that extracts selected scene elements, properties and relations. The second component (‘cognitive augmentation’) augments the extracted visual representation based on relevant non-visual stored representations. It also provides input to the third component, the top-down stream, in the form of a TD instruction, instructing the model what task to perform next. The top-down stream then guides the BU visual stream to perform the selected task in the next cycle. During this
process, the visual representations extracted from the image can be combined with relevant non- visual representations, so that the final scene representation is based on both visual information extracted from the scene and relevant stored knowledge of the world.
The extraction process is shown to have favourable properties in terms of combinatorial generalization,
generalizing well to novel scene structures and new combinations of objects, properties and relations not seen during training. Finally, we compare the model with relevant aspects of the human vision, and suggest directions for using the BU-TD scheme for integrating visual and cognitive components in the process of scene understanding.
- CBMM Funded