Definition

VLM

A Vision-Language Model (VLM) is a multimodal AI model trained on both images and text, enabling it to answer questions about visual content, describe images, or extract structured data from documents.

Frequently Asked Questions

How do VLMs connect images and text?▼

By processing the image through a vision encoder (like a CNN or ViT) and mapping those features to a projection layer that matches the text embedding space of the LLM.

Give an example of a VLM task.▼

Asking a model to write code for a website based on a uploaded screenshot mockup.

Quick Facts

CategoryNeural Architectures
Key ApplicationVisual question answering, invoice document processing, and image caption generation

Coverage Trend12 Weeks

12w agoToday

Related AI Terms

Multimodal AI Vision Transformer

VLM Media Coverage & Intelligence

arXiv AIJun 18, 2026

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in com

Read Original Coverage

arXiv AIJun 18, 2026

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmente

Read Original Coverage