Unifying Vision and Language Models with Mohit Bansal

Play Video

Join our list for notifications and early access to events

About this Episode

Today we're joined by Mohit Bansal, Parker Professor, and Director of the MURGe-Lab at UNC, Chapel Hill. In our conversation with Mohit, we explore the concept of unification in AI models, highlighting the advantages of shared knowledge and efficiency. He addresses the challenges of evaluation in generative AI, including biases and spurious correlations. Mohit introduces groundbreaking models such as UDOP and VL-T5, which achieved state-of-the-art results in various vision and language tasks while using fewer parameters. Finally, we discuss the importance of data efficiency, evaluating bias in models, and the future of multimodal models and explainability.

Connect with Mohit
Read More

Thanks to our sponsor Qualcomm AI Research

Qualcomm AI Research is dedicated to advancing AI to make its core capabilities — perception, reasoning, and action — ubiquitous across devices. Their work makes it possible for billions of users around the world to have AI-enhanced experiences on devices powered by Qualcomm Technologies. To learn more about what Qualcomm Technologies is up to on the research front, visit twimlai.com/qualcomm.

Qualcomm Technologies Logo

Related Episodes

Related Topics

More from TWIML

One Response

  1. Respected Sir,
    Firstly, I wanted to express my gratitude for the invaluable knowledge I’ve gained from your insightful articles on connection to Vision and language. Your contributions have been immensely helpful in broadening my understanding of the subject.

    I am writing to seek guidance on a technical matter, which might seem somewhat basic. Specifically, I’m interested in exploring the application of convolutional vision transformers (CvT) for vision-language tasks. My research has led me to various pretrained models like LXMERT,CLIP, BLIP, ALBEF, MAMO, MaskVLM, VinVL, among others, each offering distinct capabilities.

    My question revolves around the feasibility of modifying these pretrained models by replacing their image encoders with convolutional vision transformers (CvT). Essentially, I am curious to know if it’s plausible to alter the architecture of these models solely by integrating a CvT-based image encoder. Is such modification feasible within the context of pretrained models? My PhD focus on the efficiency of using CvT to various vision Language Tasks.

    Your insights or guidance on this matter would be immensely appreciated.

    Thank you in advance for any assistance or clarification you can provide. Your expertise and guidance would be invaluable in my endeavors..

Leave a Reply

Your email address will not be published. Required fields are marked *