Why Vision Language Models Ignore What They See | TWIML - The Voice of Machine Learning & AI

About this Episode

In this episode, we’re joined by Munawar Hayat, researcher at Qualcomm AI Research, to discuss a series of papers presented at NeurIPS 2025 focusing on multimodal and generative AI. We dive into the persistent challenge of object hallucination in Vision-Language Models (VLMs), why models often discard visual information in favor of pre-trained language priors, and how his team used attention-guided alignment to enforce better visual grounding. We also explore a novel approach to generalized contrastive learning designed to solve complex, composed retrieval tasks—such as searching via combined text and image queries—without increasing inference costs. Finally, we cover the difficulties generative models face when rendering multiple human subjects, and the new "MultiHuman Testbench" his team created to measure and mitigate issues like identity leakage and attribute blending. Throughout the discussion, we examine how these innovations align with the need for efficient, on-device AI deployment.

Why Vision Language Models Ignore What They See with Munawar Hayat

About this Episode

About the Guest

Munawar Hayat

Resources

Related Topics

Why Vision Language Models Ignore What They See with Munawar Hayat

About this Episode

About the Guest

Munawar Hayat

Thanks to our sponsor Qualcomm AI Research

Resources

Related Topics

Related Episodes