Transformer Module Networks for Systematic Generalization in Visual Question Answering

TitleTransformer Module Networks for Systematic Generalization in Visual Question Answering
Publication TypeCBMM Memos
Year of Publication2022
AuthorsYamada, M, D'Amario, V, Takemoto, K, Boix, X, Sasaki, T
Number121
Date Published02/2022
Abstract

Transformer-based models achieve great performance on Visual Question Answering (VQA). How- ever, when we evaluate them on systematic generalization, i.e., handling novel combinations of known concepts, their performance degrades. Neural Module Networks (NMNs) are a promising approach for systematic generalization that consists on composing modules, i.e., neural networks that tackle a sub-task. Inspired by Transformers and NMNs, we propose Transformer Module Network (TMN), a novel Transformer-based model for VQA that dynamically composes modules into a question-specific Transformer network. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, namely, CLEVR-CoGenT, CLOSURE and GQA-SGL, in some cases improving more than 30% over standard Transformers.

DSpace@MIT

https://hdl.handle.net/1721.1/139843

CBMM Memo No:  121

Associated Module: 

CBMM Relationship: 

  • CBMM Funded