Today we're joined by Stefan Lee, assistant professor at the school of electrical engineering and computer science at Oregon State University.
Stefan, who we sat down with at NeurIPS this past winter, is focused on the development of agents that can perceive their environment and communicate their understanding with humans in order to coordinate their actions to achieve mutual goals.
In our conversation, we focus on his paper
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, a model for learning joint representations of image content and natural language. We talk through the development and training process for this model, the adaptation of the training process to incorporate additional visual information to BERT models, where this research leads from the perspective of integration between visual and language tasks and finally, we discuss the importance of visual grounding.