Unifying Vision and Language Models with Mohit Bansal

One Response

Respected Sir,
Firstly, I wanted to express my gratitude for the invaluable knowledge I’ve gained from your insightful articles on connection to Vision and language. Your contributions have been immensely helpful in broadening my understanding of the subject.

I am writing to seek guidance on a technical matter, which might seem somewhat basic. Specifically, I’m interested in exploring the application of convolutional vision transformers (CvT) for vision-language tasks. My research has led me to various pretrained models like LXMERT,CLIP, BLIP, ALBEF, MAMO, MaskVLM, VinVL, among others, each offering distinct capabilities.

My question revolves around the feasibility of modifying these pretrained models by replacing their image encoders with convolutional vision transformers (CvT). Essentially, I am curious to know if it’s plausible to alter the architecture of these models solely by integrating a CvT-based image encoder. Is such modification feasible within the context of pretrained models? My PhD focus on the efficiency of using CvT to various vision Language Tasks.

Your insights or guidance on this matter would be immensely appreciated.

Thank you in advance for any assistance or clarification you can provide. Your expertise and guidance would be invaluable in my endeavors..

Unifying Vision and Language Models with Mohit Bansal

About this Episode

Connect with Mohit

Thanks to our sponsor Qualcomm AI Research

Resources

Related Episodes

Related Topics

More from TWIML

One Response

Leave a Reply Cancel reply