Distilling Transformers and Diffusion Models for Robust Edge Use Cases with Fatih Porikli
EPISODE 738
|
JULY
9,
2025
Watch
Follow
Share
About this Episode
Today, we're joined by Fatih Porikli, senior director of technology at Qualcomm AI Research for an in-depth look at several of Qualcomm's accepted papers and demos featured at this year’s CVPR conference. We start with “DiMA: Distilling Multi-modal Large Language Models for Autonomous Driving,” an end-to-end autonomous driving system that incorporates distilling large language models for structured scene understanding and safe planning motion in critical "long-tail" scenarios. We explore how DiMA utilizes LLMs' world knowledge and efficient transformer-based models to significantly reduce collision rates and trajectory errors. We then discuss “SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation,” a diffusion-distilled approach that combines generative models with metric depth estimation to produce sharp, accurate monocular depth maps. Additionally, Fatih also shares a look at Qualcomm’s on-device demos, including text-to-3D mesh generation, real-time image-to-video and video-to-video generation, and a multi-modal visual question-answering assistant.
About the Guest
Fatih Porikli
Qualcomm
Resources
- AI and computer vision insights at CVPR 2025
- DiMA: Distilling Multi-modal Large Language Models for Autonomous Driving
- SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation
- Workshop on [2501.09757] Distilling Multi-modal Large Language Models for Autonomous Driving Efficient Large Vision Models
- Workshop on Vision-based Assistants in the Real World
- Multi-modal visual Q&A assistant
- Video-to-video generation on mobile
- VGGT: Visual Geometry Grounded Transformer
- VAD: Vectorized Scene Representation for Efficient Autonomous Driving
- Planning-oriented Autonomous Driving
- MVDream: Multi-view Diffusion for 3D Generation
- Google DeepMind Veo model
- Clockwork Diffusion: Efficient Generation With Model-Step Distillation
- Gen AI at the Edge: Qualcomm AI Research at CVPR 2024 with Fatih Porikli - #688
- Data Augmentation and Optimized Architectures for Computer Vision with Fatih Porikli - #635
- Optical Flow Estimation, Panoptic Segmentation, and Vision Transformers with Fatih Porikli - #579
- Speculative Decoding and Efficient LLM Inference with Chris Lott - #717


