Why Vision Language Models Ignore What They See with Munawar Hayat
EPISODE 758
|
DECEMBER
9,
2025
Watch
Follow
Share
About this Episode
In this episode, we’re joined by Munawar Hayat, researcher at Qualcomm AI Research, to discuss a series of papers presented at NeurIPS 2025 focusing on multimodal and generative AI. We dive into the persistent challenge of object hallucination in Vision-Language Models (VLMs), why models often discard visual information in favor of pre-trained language priors, and how his team used attention-guided alignment to enforce better visual grounding. We also explore a novel approach to generalized contrastive learning designed to solve complex, composed retrieval tasks—such as searching via combined text and image queries—without increasing inference costs. Finally, we cover the difficulties generative models face when rendering multiple human subjects, and the new "MultiHuman Testbench" his team created to measure and mitigate issues like identity leakage and attribute blending. Throughout the discussion, we examine how these innovations align with the need for efficient, on-device AI deployment.
About the Guest
Munawar Hayat
Qualcomm AI Research
Resources
- Attention Guided Alignment in Vision Language Models
- Generalized Contrastive Learning (GCL): Better Search Across Text and Images
- MultiHuman Testbench: Raising the Bar for Multi Person Image Generation
- Qualcomm at NeurIPS 2025: Pushing the boundaries of AI research
- VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
- CoVR: Learning Composed Video Retrieval from Web Video Captions
- AI2D-RST: A multimodal corpus of 1000 primary school science diagrams
- CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
- SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
- KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
- OmniDraft: Cross Vocabulary Online Adaptive Drafter for On-device Speculative Decoding
- Neodragon Neodragon Badge: Mobile Video Generation using Diffusion Transformer
- Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
- Introducing Nano Banana Pro
- High-Efficiency Diffusion Models for On-Device Image Generation and Editing with Hung Bui - #753
- AI at the Edge: Qualcomm AI Research at NeurIPS 2024 with Arash Behboodi - #711
- Inside Nano Banana 🍌 and the Future of Vision-Language Models with Oliver Wang - #748
