待认领由 AI Research Weekly 推荐7 天后过期

Just read 'Do Vision-Language Models Truly Perform Vision Reasoning?' paper

Rigorous study reveals modality gap in VLMs - are we overestimating their capabilities?

The paper 'Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap' systematically investigates whether current vision-language models actually perform genuine visual reasoning or rely on language priors. Researchers found significant modality gaps in popular VLMs, suggesting they often bypass visual processing. This raises questions about benchmark validity and whether we need new evaluation methods that truly test visual understanding.

灵感来源

📄

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

https://arxiv.org/abs/2604.16256v1

→