Ch 11: Multimodal Large Language Models

Foundations

Core concepts: what multimodal means, how architectures bridge modalities, and why alignment is the key to useful systems.

What Is a Multimodal LLM?

A multimodal LLM processes and reasons over more than one input or output modality — such as text plus images or text plus audio. The language model stays central, but modality-specific encoders and adapters convert non-text inputs into representations the LLM can use.

💡 Think of the LLM as a brain that only speaks one language. Multimodal adapters are translators that convert what eyes and ears perceive into that language so the brain can reason about it.

Image

raw pixels

→

Vision Encoder

e.g. ViT, SigLIP

→

Adapter / Projector

align spaces

→

Language Model

reasoning

→

Response

grounded text

Text Prompt

→ feeds directly into the language model alongside image-derived tokens

More Than Adding Images

A common misconception is that multimodality simply means "the model can see pictures." In reality, multimodality is about aligning representations across modalities so the system can answer grounded questions rather than hallucinating from text priors alone. The model does not natively read pixels or hear audio — other models translate those signals into a form the LLM can reason over.

What Makes It Work

Three components must cooperate for a multimodal LLM to function:

Modality encoder: A specialized model (e.g., a Vision Transformer) that converts raw input into dense embeddings.
Alignment layer: A projector or adapter that maps encoder embeddings into the language model's token space. Without this bridge, the two models speak different "languages."
Language model: The reasoning engine that conditions on both text tokens and modality-derived representations to generate responses.

The quality of the alignment layer is often the bottleneck. A strong vision encoder paired with a weak adapter produces a system that can describe images in generic terms but cannot answer specific visual questions. See Topic 2: Text-Image Architecture for the detailed pattern.

The Alignment Spectrum

Alignment Quality	Behavior	Example
None	Model ignores image, answers from text priors	Generic captions regardless of image
Weak	Model gets general category right but misses details	"A chart" instead of reading the chart
Strong	Model references specific visual evidence	Reading exact numbers from a bar chart

→ Multimodal LLMs succeed only when representation alignment and reasoning both work — perception mistakes propagate into every downstream language answer.

Python Example — Using a Multimodal API

import base64, httpx
from openai import OpenAI

# Initialize the client for a multimodal model
client = OpenAI()

# Read and encode an image as base64
image_data = base64.b64encode(
    httpx.get("https://example.com/chart.png").content
).decode("utf-8")

# Send both text and image to the model
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",
             "text": "What trend does this chart show?"},
            {"type": "image_url",
             "image_url": {
                 "url": f"data:image/png;base64,{image_data}"
             }}
        ]
    }]
)

print(response.choices[0].message.content)

Follow-up Questions

Can a multimodal model generate images, or only consume them?

Some models can both consume and generate images (e.g., GPT-4o with DALL-E integration, Gemini with native image generation), but the architectures differ significantly. Consuming images requires an encoder and adapter. Generating images requires a decoder (typically a diffusion model). Most current systems specialize in one direction or chain separate models.

How does the number of image tokens affect cost and latency?

Each image is converted into a fixed number of tokens (e.g., 85–1700 tokens depending on resolution in GPT-4o). These tokens consume context window space and are billed like text tokens. Higher-resolution settings use more tokens, increasing both cost and processing time. Always choose the minimum resolution that preserves task-critical detail.

What is the difference between early fusion and late fusion?

Early fusion mixes modality representations at the input level, letting the model attend across modalities from the first layer. Late fusion processes each modality independently and combines them only at decision time. Early fusion is more expressive but requires more compute; most modern multimodal LLMs use a form of early fusion via adapter-injected tokens.

Text-Image Architecture Pattern

The dominant pattern for text-image systems is: vision encoder converts images to embeddings, a projection layer aligns those embeddings to the language model's space, and the LLM reasons over both text tokens and image-derived tokens to produce a response.

💡 The vision encoder is a camera, the projector is a simultaneous interpreter, and the LLM is the analyst who only reads reports in one language.

Image Input

Raw pixels (e.g., 224x224 or 336x336 patches), possibly at multiple resolutions.

↓

Vision Encoder (ViT / SigLIP / CLIP)

Splits image into patches, produces a sequence of dense embedding vectors. Often pre-trained and frozen.

↓

Adapter / Projector (MLP, Q-Former, Perceiver)

Maps vision embeddings into the LLM's token embedding space. This is the alignment bottleneck — trained specifically for the LLM.

↓

Language Model

Receives interleaved text tokens and image tokens. Self-attention runs across all tokens, enabling cross-modal reasoning.

The Bridge Is Everything

The language model is not natively reading pixels. Another model turns pixels into a form the LLM can reason over. The quality and design of this bridge — the adapter/projector — determines how much visual detail survives into the reasoning stage.

Common Adapter Architectures

Adapter Type	How It Works	Used In
Linear projection	Simple learned linear map from vision to text space	LLaVA v1
MLP projector	Two-layer MLP with nonlinearity for richer mapping	LLaVA v1.5+
Q-Former	Learned queries attend to vision features via cross-attention	BLIP-2, InstructBLIP
Perceiver resampler	Fixed number of latent queries compress variable-length vision outputs	Flamingo, Qwen-VL

Training Strategy

Most text-image systems train in two phases:

Pre-training alignment: Train the projector on large-scale image-caption pairs while keeping both the vision encoder and LLM frozen. This teaches the projector to map visual features into the language space.
Instruction tuning: Fine-tune the projector (and sometimes parts of the LLM) on visual question-answering and instruction-following data. This teaches the system to respond to complex visual queries.

See Topic 4: Visual Grounding for why the second phase is critical for producing answers tied to actual image evidence.

→ The adapter between vision encoder and LLM is the alignment bottleneck — it determines how much visual detail the language model can actually reason about.

Python Example — LLaVA-style Forward Pass (Pseudocode)

import torch
from transformers import AutoModelForCausalLM

# Pseudocode for a LLaVA-style multimodal forward pass
class MultimodalLLM(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # Vision encoder: pre-trained ViT, usually frozen
        self.vision_encoder = load_vit("openai/clip-vit-large")
        # Projector: maps vision dims -> LLM dims
        self.projector = torch.nn.Sequential(
            torch.nn.Linear(1024, 4096),  # vision_dim -> llm_dim
            torch.nn.GELU(),
            torch.nn.Linear(4096, 4096),
        )
        # Language model: pre-trained, may be partially fine-tuned
        self.llm = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")

    def forward(self, image, text_ids):
        # Step 1: Encode image into patch embeddings
        vision_feats = self.vision_encoder(image)  # [B, num_patches, 1024]
        # Step 2: Project into LLM space
        image_tokens = self.projector(vision_feats) # [B, num_patches, 4096]
        # Step 3: Get text embeddings
        text_embeds = self.llm.get_input_embeddings()(text_ids)
        # Step 4: Concatenate image + text tokens
        combined = torch.cat([image_tokens, text_embeds], dim=1)
        # Step 5: LLM reasons over combined sequence
        return self.llm(inputs_embeds=combined)

Follow-up Questions

Should the vision encoder be frozen or fine-tuned?

Freezing the vision encoder preserves its strong pre-trained features and reduces training cost. Fine-tuning the encoder can improve performance on domain-specific tasks (e.g., medical imaging) but risks catastrophic forgetting of general capabilities. Most systems freeze the encoder for pre-training alignment and optionally unfreeze it during instruction tuning.

How does image resolution affect the number of tokens?

Higher resolution images produce more patches from the vision encoder, which means more tokens entering the LLM. For example, a 224x224 image with 14x14 patches produces 256 tokens, while 336x336 produces 576. Some systems use dynamic resolution, tiling large images into multiple crops to preserve detail at the cost of more tokens.

What is the role of the Q-Former in BLIP-2?

The Q-Former uses a fixed set of learned query tokens that attend to the vision encoder's output via cross-attention. This compresses the variable-length vision output into a fixed number of tokens (typically 32), reducing the load on the LLM. The trade-off is that compression can lose spatial detail, which is why systems like LLaVA pass all patch tokens directly.

CLIP and Contrastive Alignment

CLIP showed that image and text representations can be aligned through contrastive learning on large-scale paired data. It made zero-shot vision tasks practical and established the idea that natural language can serve as a supervision signal for perception models.

💡 CLIP is like teaching two people who speak different languages to point at the same objects. After enough practice, they agree on what goes together even for things they have never seen before.

Contrastive Embedding Space — matched pairs pull together, mismatched pairs push apart

Image embeddings

Text embeddings

Matched pair

How Contrastive Learning Works

CLIP trains an image encoder and a text encoder simultaneously on ~400 million image-text pairs scraped from the internet. For each batch of N pairs:

Encode all N images and all N texts into the same embedding space.
Compute cosine similarity between every image-text combination (an N×N matrix).
Maximize similarity for the N correct pairs (the diagonal) and minimize similarity for the N²−N incorrect combinations.

The result is a shared embedding space where semantically related images and texts cluster together, regardless of the specific words or visual appearance used.

Why CLIP Matters

Before CLIP, vision models were trained on fixed label sets (e.g., ImageNet's 1,000 classes). CLIP demonstrated three breakthroughs:

Zero-shot transfer: Classify any image by comparing it to text descriptions of the candidate classes — no retraining needed.
Language as supervision: Natural language captions provide richer, more flexible supervision than category labels.
Shared representation space: The aligned space supports image retrieval, visual question answering, and multimodal generation (see Topic 1: What Is a Multimodal LLM?).

Limitations

Limitation	Consequence
Bag-of-concepts bias	CLIP captures what is in an image but struggles with spatial relationships ("left of," "on top of")
Training data biases	Internet-scraped pairs contain cultural and demographic biases that transfer into the embedding space
Fine-grained distinction	Distinguishing similar species, medical conditions, or technical diagrams may require domain-specific fine-tuning

→ CLIP proved that aligned representation spaces can support flexible multimodal reasoning and retrieval — replacing narrow fixed-label vision systems with open-vocabulary understanding.

Python Example — CLIP Zero-Shot Classification

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

# Load pre-trained CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load an image
image = Image.open("photo.jpg")

# Define candidate classes as natural language
labels = ["a photo of a cat", "a photo of a dog",
          "a photo of a car", "a photo of a building"]

# Encode image and text into the shared space
inputs = processor(text=labels, images=image,
                   return_tensors="pt", padding=True)
outputs = model(**inputs)

# Cosine similarities -> probabilities
probs = outputs.logits_per_image.softmax(dim=-1)
for label, prob in zip(labels, probs[0]):
    print(f"  {label}: {prob:.1%}")

Follow-up Questions

What is SigLIP and how does it improve on CLIP?

SigLIP replaces CLIP's softmax-based contrastive loss with a sigmoid loss applied to each image-text pair independently. This removes the need for large batch sizes (which CLIP relies on for negatives) and often achieves better performance. SigLIP is used as the vision encoder in PaLI and Gemini.

Can CLIP embeddings be used for retrieval directly?

Yes. CLIP embeddings are widely used for cross-modal retrieval: given a text query, find the most similar images (or vice versa) by nearest-neighbor search in the shared embedding space. This powers image search engines, content moderation, and multimodal RAG systems.

How does contrastive pre-training differ from generative pre-training?

Contrastive pre-training learns to distinguish matching pairs from non-matching pairs, producing good embeddings for retrieval and classification. Generative pre-training (like GPT) learns to predict the next token, producing good models for text generation. Modern multimodal systems often combine both: contrastive alignment for the encoder, generative training for the LLM.

Visual Grounding

Visual grounding means the model's language output is actually tied to the image evidence rather than being generated from language priors or stereotyped assumptions. A grounded answer about a chart should reflect what is present in the image, not what is merely plausible in general.

💡 Grounding is like the difference between a witness testimony (based on observation) and a guess (based on common sense). Only the testimony counts as evidence.

Same question about a chart image — grounded vs. ungrounded responses

Ungrounded

Answers from text priors

"The chart shows that revenue has been growing steadily over the past year, which is typical for a company in this sector."

Problem: sounds plausible but is not based on the actual chart data

Grounded

Answers from image evidence

"The bar chart shows Q1 at $2.3M, Q2 at $1.8M (a 22% decline), Q3 at $2.1M, and Q4 at $3.1M. Revenue dipped in Q2 before recovering."

Strength: references specific values visible in the chart

The Core Trust Problem

Grounding is the core trust problem in multimodal AI. A language model that has read millions of financial reports can generate plausible-sounding chart descriptions without looking at the chart at all. Fluency without grounding produces very convincing errors. Users assume "the model saw the image, so it must know" — making ungrounded answers especially dangerous.

How Grounding Fails

Prior override: The model's text knowledge overrides weak visual signal. A chart labeled "Revenue Growth" may trigger boilerplate about growth even if the chart shows decline.
Hallucinated details: The model invents specific numbers, labels, or objects not present in the image.
Stereotype completion: Given a photo of a kitchen, the model describes items it expects to see (microwave, toaster) rather than what is actually visible.

Testing for Grounding

To verify grounding, use adversarial or counter-intuitive images: a chart that shows decline when the title says "growth," a photo with unusual objects, or a document with intentional errors. If the model's answer matches the image rather than common expectations, it is grounded. See Topic 7: Evaluating Multimodal Systems for systematic evaluation approaches.

→ Grounding is what separates a multimodal model that observes from one that merely guesses — and users cannot easily tell the difference without deliberate testing.

Python Example — Grounding Test with Counter-Intuitive Image

from openai import OpenAI

client = OpenAI()

# Test grounding: the chart title says "Growth"
# but the actual bars show a 40% decline
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",
             "text": "Describe the trend shown in this chart. "
                      "Report the actual values you see."},
            {"type": "image_url",
             "image_url": {"url": "file://misleading_chart.png"}}
        ]
    }]
)

answer = response.choices[0].message.content
# GROUNDED: mentions the decline despite the misleading title
# UNGROUNDED: echoes "growth" from the title
print("Grounded?", "decline" in answer.lower())

Follow-up Questions

How can you encourage grounding in prompts?

Ask the model to "describe only what you see", "quote the exact text visible", or "if you cannot determine this from the image, say so." Explicit instructions to reference visual evidence and refuse when uncertain significantly improve grounding behavior compared to open-ended questions.

Is grounding better in newer models?

Yes, generally. Models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro show significantly better grounding than earlier systems. However, no current model is perfectly grounded. All still exhibit hallucination on challenging images, especially small text, dense charts, and spatially complex scenes. Verification remains essential.

What is the relationship between grounding and hallucination?

Hallucination is the symptom; lack of grounding is the cause. When the model fails to ground its output in the visual input, it falls back on text priors, which generates plausible but unverified content. Reducing hallucination in multimodal systems requires improving grounding, not just penalizing "wrong" outputs.

OCR vs Vision-Language Understanding

OCR excels when the image is primarily text (documents, forms, receipts). Native vision-language models are more valuable when spatial layout, objects, relationships, and mixed visual-text cues matter together. Many strong systems combine both rather than treating them as mutually exclusive.

💡 OCR reads the letters on a page. A vision-language model reads the page and understands the diagram next to the letters.

OCR Best For

✓ Typed documents and forms
✓ Receipts and invoices
✓ Screenshots with structured text
✓ Scanned pages with uniform layout
✗ Photos with mixed text and objects
✗ Charts requiring spatial reasoning

Vision-Language Best For

✓ Charts, graphs, and diagrams
✓ Photos with spatial relationships
✓ Mixed media documents
✓ Handwritten + visual content
✗ Dense text extraction at scale
✗ Exact character-level fidelity

Choosing the Right Tool

The practical answer is to choose the tool that best matches the information source. If you need to extract structured text from thousands of scanned invoices, dedicated OCR pipelines (Tesseract, Azure Document Intelligence, Google Document AI) will be faster, cheaper, and more accurate than sending each page to a multimodal LLM.

But if you need to understand a complex infographic that mixes charts, icons, and text annotations — and answer questions about it — a vision-language model is far more capable because it can reason about the spatial relationships between visual elements.

The Hybrid Approach

Many production systems combine both:

OCR first: Extract structured text from the document.
Vision-language second: Send the image plus extracted text to the multimodal model, giving it both visual context and reliable text extraction.

This hybrid approach addresses a key weakness of vision-language models: they can misread small text, confuse similar characters (0/O, 1/l), or skip text in crowded layouts. The OCR layer provides a reliable text backbone while the vision model handles layout and visual reasoning.

When OCR Alone Is Not Enough

Scenario	Why OCR Falls Short	What VL Models Add
Chart comprehension	OCR extracts axis labels but not visual trends	Understands bar heights, line slopes, relative sizes
Form with checkboxes	OCR reads text but misses checked/unchecked state	Perceives checkbox state as visual signal
Handwritten notes	OCR accuracy drops significantly	Handles messy handwriting with contextual inference

→ OCR and vision-language models are complementary — combine them when the task requires both precise text extraction and visual-spatial reasoning.

Python Example — Hybrid OCR + Vision-Language Pipeline

import pytesseract
from PIL import Image
from openai import OpenAI

# Step 1: Extract text with OCR for reliable text extraction
image = Image.open("invoice.png")
ocr_text = pytesseract.image_to_string(image)

# Step 2: Send image + OCR text to multimodal model
# The OCR provides reliable text; the model adds layout understanding
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",
             "text": f"OCR extracted text:\n{ocr_text}\n\n"
                      "Using the image and extracted text, "
                      "identify the total amount and due date."},
            {"type": "image_url",
             "image_url": {"url": "file://invoice.png"}}
        ]
    }]
)
print(response.choices[0].message.content)

Follow-up Questions

Are multimodal LLMs replacing traditional OCR?

For high-volume, structured document processing, traditional OCR remains faster and cheaper. Multimodal LLMs are replacing OCR in scenarios requiring understanding rather than just extraction — answering questions about documents, summarizing visual content, or handling unstructured layouts. The two are converging but not yet substitutes at scale.

How accurate are multimodal LLMs at reading small text in images?

Accuracy varies significantly with image resolution and text size. Current models handle large, clear text well but struggle with text under ~8px height, low-contrast text, and text in unusual fonts or orientations. For critical text extraction, always verify with dedicated OCR or request high-resolution image inputs.

Practice & Evaluation

How to prompt, evaluate, and debug multimodal systems — plus the modalities beyond static images and the business cases that deliver real value.

Multimodal Prompting

Multimodal prompting must direct the model not only on what to answer but also on what visual evidence to inspect. Good prompts specify the task, the level of detail needed, and whether the model should prioritize text, objects, layout, or anomalies in the image.

💡 A text prompt is like asking a question. A multimodal prompt is like asking a question and pointing at the whiteboard — you must tell the model where to look, not just what to think about.

Weak vs. strong multimodal prompts for the same image

Weak Prompt

What does this image show?

Too vague — model picks whatever it finds easiest to describe

Strong Prompt

This is a bar chart of quarterly revenue. Read the exact value for each quarter from the y-axis. If any value is unclear, say so. Report any trend or anomaly.

Specific task, precision level, and abstention guidance

The Same Principle, Extended

The fundamental prompting principle still applies: clear tasks beat vague requests. But multimodal prompting adds dimensions that text-only prompting does not require:

Perception guidance: Tell the model what type of visual content to focus on (text in image, object positions, colors, chart axes).
Detail calibration: Specify whether you need a high-level summary or pixel-level precision.
Abstention instruction: Explicitly tell the model to say "I cannot determine this" when image quality or content is insufficient. Without this, the model will fabricate details.
Format specification: Request structured output (JSON, table, list) when you need to parse the results programmatically.

Prompting Strategies by Image Type

Image Type	Effective Strategy
Charts/graphs	Ask for specific axis values, trends, and data points by name
Documents	Request structured extraction with field names
Photos/scenes	Specify spatial reasoning ("what is to the left of...")
Screenshots	Ask about UI elements, error messages, or specific regions
Medical/technical	Request domain-specific observations with confidence levels

See Topic 4: Visual Grounding for why abstention instructions are critical for maintaining trust.

→ Multimodal prompts must guide perception, not just reasoning — tell the model what to look at and how precisely, or it will default to superficial description.

Python Example — Structured Multimodal Prompt

# A well-structured multimodal prompt for chart analysis
prompt = """Analyze this bar chart image. Follow these steps:

1. AXES: Read the x-axis labels and y-axis scale.
2. VALUES: Report the exact value for each bar.
   If a value is ambiguous, give a range.
3. TREND: Describe the overall trend in 1 sentence.
4. ANOMALIES: Note any unusual patterns or outliers.

Output as JSON with keys: axes, values, trend, anomalies.
If any part is unreadable, set its value to null."""

# This structured approach gets far better results
# than asking "What does this chart show?"
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url",
             "image_url": {"url": chart_url}}
        ]
    }],
    response_format={"type": "json_object"}
)

Follow-up Questions

Does chain-of-thought prompting work for multimodal tasks?

Yes. Asking the model to "first describe what you see, then reason about it" often improves accuracy on complex visual tasks. This two-step approach forces the model to commit to observations before drawing conclusions, reducing the chance that text priors override visual evidence. See Topic 4: Visual Grounding.

How should you handle multiple images in one prompt?

When sending multiple images, label each image explicitly ("Image 1 shows..., Image 2 shows...") and reference them by label in your question. Without labels, the model may confuse which image you are asking about, especially when the images are similar. Most APIs support multiple images in a single message.

What perception limits should you account for?

Current models struggle with: small text (under 8px), fine-grained counting (exact number of objects in a crowd), spatial precision (exact pixel coordinates), and rotated or skewed content. Design prompts that acknowledge these limits by asking for confidence levels or ranges rather than exact values.

Evaluating Multimodal Systems

Evaluation should measure grounded correctness, not just fluency. Multimodal evaluation usually requires task-specific datasets plus manual review because many failures are subtle and cannot be detected by string matching alone.

💡 Evaluating a multimodal model is like grading an open-book exam where you must also check whether the student actually read the book or just made up convincing answers.

Evaluation dimensions for multimodal systems

🎯

Answer Accuracy

Is the factual content correct given the image?

🔍

Object/Attribute

Are named objects and their properties correct?

📄

OCR Fidelity

Is text in the image read accurately?

📐

Spatial Reasoning

Are positions and relationships correct?

🛑

Refusal Quality

Does it refuse when the image is unclear?

👥

Human Preference

Do users find the response useful?

Why Standard NLP Metrics Fail

Multimodal quality cannot be evaluated with standard NLP metrics like BLEU or ROUGE. A response that uses different words to describe the same visual observation may score poorly on string matching but be perfectly correct. Conversely, a response that copies common patterns may score well on string overlap while being completely ungrounded.

Evaluation Dimensions

Depending on the task, evaluation should span multiple dimensions:

Answer accuracy: Is the response factually correct given the specific image?
Object/attribute correctness: Are named objects, colors, quantities, and labels correct?
OCR fidelity: When text appears in the image, is it read accurately?
Spatial reasoning: Are positional relationships (left/right, above/below, inside/outside) correct?
Refusal behavior: Does the model appropriately refuse or hedge when the image is blurry, ambiguous, or insufficient?
Human preference: Is the response actually useful to the target user?

Building a Multimodal Eval Suite

Component	Purpose	Example
Task-specific dataset	Test accuracy on your actual use case	100 real customer support screenshots with ground-truth answers
Adversarial images	Test grounding and robustness	Charts with misleading titles, photos with unusual objects
Edge-case gallery	Test failure modes	Blurry images, tiny text, dense layouts
Human review protocol	Catch subtle errors automated metrics miss	Domain experts rating responses on a 1–5 scale

See Topic 8: Common Failure Modes for the specific failures your eval suite should be designed to catch.

→ Multimodal evaluation requires task-specific datasets and human review — fluency metrics will not catch the subtle perception and grounding errors that matter most.

Python Example — Simple Eval Framework

import json

def evaluate_multimodal_response(response, ground_truth):
    """Score a multimodal response across key dimensions."""
    scores = {}

    # 1. Factual accuracy: check key facts appear in response
    key_facts = ground_truth["facts"]
    found = sum(1 for f in key_facts if f.lower() in response.lower())
    scores["accuracy"] = found / len(key_facts)

    # 2. Hallucination check: detect fabricated details
    forbidden = ground_truth.get("absent_objects", [])
    hallucinated = [f for f in forbidden if f.lower() in response.lower()]
    scores["hallucination_count"] = len(hallucinated)

    # 3. Refusal quality: did it refuse when it should have?
    should_refuse = ground_truth.get("should_refuse", False)
    refusal_phrases = ["cannot determine", "unclear", "not visible"]
    did_refuse = any(p in response.lower() for p in refusal_phrases)
    scores["refusal_correct"] = did_refuse == should_refuse

    return scores

Follow-up Questions

What public benchmarks exist for multimodal evaluation?

Key benchmarks include MMMU (multi-discipline understanding), MMBench (broad visual reasoning), VQAv2 (visual question answering), TextVQA (text-in-image reading), and RealWorldQA (real-world spatial reasoning). However, public benchmarks often saturate quickly and may not reflect your specific use case. Custom eval sets are essential.

Can you use an LLM to judge multimodal responses?

Yes, LLM-as-judge approaches work for multimodal evaluation, but with caveats. You can send the image, the model's response, and the ground truth to a stronger model and ask it to rate accuracy. This catches many errors but can miss subtle spatial or OCR mistakes. Combine with human review for high-stakes applications.

How often should you re-evaluate after model updates?

Re-evaluate after every model version change. Provider model updates can change multimodal behavior significantly — improving some capabilities while regressing on others. Maintain a fixed eval set and run it automatically to detect regressions before they reach production.

Common Failure Modes

Multimodal LLMs commonly hallucinate unseen objects, misread small text, lose spatial relationships, confuse charts, overtrust noisy OCR, and answer beyond what the image actually supports. These failures are especially dangerous because users assume the model saw the image.

💡 Multimodal failures are like an overconfident tour guide who describes landmarks that are not there — the confidence makes the errors more harmful, not less.

Object Hallucination

Describes objects not present in the image based on scene expectations (e.g., "there is a laptop on the desk" when no laptop exists).

Text Misreading

Misreads small or stylized text, confuses similar characters (0/O, 1/l/I), or invents text that is not in the image.

Spatial Confusion

Gets left/right, above/below, or containment relationships wrong, especially in complex layouts.

Chart Misinterpretation

Reads wrong values from axes, confuses chart types, or describes trends opposite to what the data shows.

Overconfident OCR Trust

Trusts noisy or partial OCR output without hedging, propagating extraction errors into analysis.

Distribution Shift

Performance on real-world images is much worse than on curated benchmarks due to noise, blur, unusual angles.

Why Multimodal Failures Are Dangerous

A senior answer adds that multimodal failures are especially dangerous because users may assume "the model saw the image, so it must know." In text-only settings, users understand the model is generating from training data. In multimodal settings, the image creates an illusion of observation, making hallucinations more credible and harder to catch.

Failure Categories in Detail

Hallucinating unseen objects: The model describes plausible but absent elements based on scene priors. A kitchen image may trigger descriptions of common appliances regardless of what is actually shown.
Misreading small text: Characters under ~8 pixels are unreliable. Numbers are especially prone to errors (e.g., "2,341" read as "2,841").
Losing spatial relationships: "The red box is above the blue box" may be reversed. Models encode position weakly compared to object identity.
Confusing charts: Values read from the wrong axis, trends described opposite to the data, or chart types misidentified.
Answering beyond the image: When asked about something not visible, the model fills in from general knowledge rather than refusing.

Mitigation Strategies

Strategy	How It Helps
Abstention prompting	Instruct the model to say "I cannot determine" when uncertain
High-resolution mode	More pixels per patch reduces text misreading
Multi-crop processing	Process image regions separately for detail-heavy areas
Verification pipeline	Use a second model or OCR to cross-check critical claims
Human-in-the-loop	Require human review for high-stakes visual decisions

See Topic 4: Visual Grounding for the underlying grounding problem that drives most failures, and Topic 7: Evaluating Multimodal Systems for how to build eval suites that catch them.

→ Multimodal errors are uniquely dangerous because the image creates an illusion of observation — grounding checks and abstention are essential, not optional.

Python Example — Hallucination Detection Check

def check_for_hallucination(image_path, primary_response):
    """Cross-check a multimodal response with a second query."""
    from openai import OpenAI
    client = OpenAI()

    # Ask a focused verification question about key claims
    verification_prompt = f"""The following response was generated about
an image. Verify whether each claimed object or fact is
actually visible in the image.

Response to verify:
{primary_response}

For each claim, state: CONFIRMED, UNCERTAIN, or NOT_VISIBLE.
Only mark CONFIRMED if you can clearly see the evidence."""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": verification_prompt},
                {"type": "image_url",
                 "image_url": {"url": f"file://{image_path}"}}
            ]
        }]
    )
    return result.choices[0].message.content

Follow-up Questions

How do you distinguish perception errors from reasoning errors?

Ask the model to describe what it sees before answering the question. If the description is wrong (e.g., wrong objects, wrong text), it is a perception error. If the description is correct but the conclusion is wrong, it is a reasoning error. This distinction matters because the fixes are different: perception errors require better encoders or resolution; reasoning errors require better prompting or model capability.

Does increasing image resolution always help?

Higher resolution improves text reading and fine detail detection but increases token count and cost. It also does not help with fundamental spatial reasoning weaknesses. The optimal strategy is to use high resolution only for regions that need it (multi-crop approaches) rather than upscaling the entire image.

Are some image types more prone to hallucination?

Yes. Dense charts, medical images, satellite imagery, and technical diagrams produce significantly more hallucinations than everyday photos. These images require precise interpretation that models were less exposed to during training. For these domains, always pair multimodal models with domain-specific verification.

Audio & Video Modalities

Audio and video introduce time, so the system must model sequences of frames or acoustic features as well as their relationship to language. That increases compute cost and raises additional alignment questions, such as what moment in the video supports the answer.

💡 Static images are snapshots. Audio and video are movies — the model must now decide not just what to look at, but when to look.

🖼

Image

Single frame, no temporal dimension. One-shot encoding.

Complexity: Low

🎵

Audio

Temporal sequences of acoustic features. Speech, music, environmental sounds.

Complexity: Medium

🎬

Video

Spatial + temporal. Frame sampling, segmentation, synchronization with audio.

Complexity: High

The Time Dimension

Temporal modalities require capabilities that static image processing does not:

Sampling: A 60-second video at 30fps has 1,800 frames. You cannot encode all of them — you must sample strategically (uniform sampling, keyframe detection, or scene-change detection).
Segmentation: Breaking audio or video into meaningful segments (speech turns, scenes, actions) before encoding.
Synchronization: Aligning audio track with video frames so the model can associate speech with visual events.
Hierarchical reasoning: Understanding events at multiple time scales (a gesture within a sentence within a scene within a conversation).

Audio Processing Patterns

Audio is typically processed through a speech/audio encoder (like Whisper) that converts acoustic features into embedding sequences. These embeddings are then projected into the LLM's space, similar to how vision encoders work for images. The key difference is that audio sequences can be very long — a 10-minute audio clip produces far more tokens than a single image.

Video Processing Challenges

Challenge	Why It Is Hard	Common Approach
Frame count	Too many frames to encode at full rate	Sample 8–32 frames uniformly or at scene changes
Temporal grounding	Must link language answer to specific moment	Timestamp prediction or segment-level attention
Compute cost	Encoding 32 frames costs 32x a single image	Shared encoder with temporal pooling
Long-form reasoning	Events spanning minutes require memory	Hierarchical summarization of frame embeddings

See Topic 1: What Is a Multimodal LLM? for the general architecture pattern that extends to these temporal modalities.

→ Temporal modalities require sampling, segmentation, synchronization, and hierarchical reasoning — making them fundamentally harder than static image understanding.

Python Example — Video Frame Sampling for Multimodal Input

import cv2
import base64

def sample_video_frames(video_path, num_frames=8):
    """Sample frames uniformly from a video for multimodal input."""
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    # Calculate uniform sample indices
    indices = [int(i * total_frames / num_frames)
               for i in range(num_frames)]

    frames_b64 = []
    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            # Encode frame as JPEG, then base64
            _, buffer = cv2.imencode(".jpg", frame)
            b64 = base64.b64encode(buffer).decode("utf-8")
            frames_b64.append(b64)
    cap.release()

    # Build multimodal message with sampled frames
    content = [{"type": "text",
                "text": f"These are {num_frames} frames sampled "
                        "uniformly from a video. Describe what happens."}]
    for b64 in frames_b64:
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{b64}"}
        })
    return content

Follow-up Questions

How does Gemini handle video natively compared to frame sampling?

Gemini processes video natively by encoding frames through its vision encoder and interleaving them with audio tokens. This avoids the information loss of coarse frame sampling but requires significantly more compute and context window space. For long videos, Gemini still uses internal sampling and temporal compression strategies.

Can multimodal LLMs process real-time audio/video streams?

Some models (GPT-4o, Gemini Live) support real-time streaming for audio, processing speech incrementally. Real-time video is more challenging due to bandwidth and compute requirements. Current real-time systems typically operate on audio with periodic video frame sampling rather than continuous video encoding.

What is Whisper's role in multimodal audio systems?

Whisper is OpenAI's speech recognition model that converts audio into text. In some multimodal systems, Whisper serves as the audio encoder, producing embeddings that are projected into the LLM's space. In simpler pipelines, Whisper transcribes audio to text first, and the LLM reasons over the transcript. The encoder approach preserves more acoustic information (tone, emphasis, speaker identity).

High-Value Use Cases

The best early multimodal use cases are those where grounding is clear and the workflow has measurable value: document understanding, screenshot support, visual quality inspection, chart explanation, medical imaging assistance with human oversight, and accessibility-oriented image description.

💡 The business value of multimodality comes from grounded perception in places where text alone is insufficient — not from adding images for novelty.

High ROI

Document Understanding

Invoices, contracts, forms — structured extraction with layout awareness saves hours of manual work.

High ROI

Screenshot Support

Users share screenshots of errors, UI states, dashboards. The model can diagnose from the visual context.

Medium ROI

Quality Inspection

Manufacturing defect detection, product photo validation, compliance checking on visual materials.

Medium ROI

Chart & Data Explanation

Auto-generate plain-language summaries of business dashboards and reports for non-technical stakeholders.

Specialized

Medical Imaging Assist

Pre-screening, report drafting, teaching aids. Always requires human oversight for clinical decisions.

Specialized

Accessibility

Alt text generation, image descriptions for screen readers, visual content narration for visually impaired users.

Value Comes from Grounded Perception

A strong interview answer focuses on use cases where multimodality adds nontrivial evidence — not on adding images for novelty. The business value comes from grounded perception in places where text alone is insufficient. If a text-only model could do the job equally well, the multimodal overhead (cost, latency, complexity) is not justified.

Evaluating Use Case Fit

Criterion	Good Fit	Poor Fit
Grounding clarity	Answer is verifiable from the image	Answer requires world knowledge more than image
Measurable value	Saves hours of manual work per day	"Nice to have" feature with no metric
Error tolerance	Errors are caught by downstream process	Single error causes major harm
Data availability	Representative image dataset for eval exists	No way to measure visual accuracy

Implementation Priorities

When deploying multimodal use cases, prioritize in this order:

Build evaluation first: Create a task-specific eval set with ground-truth answers before building the pipeline. See Topic 7: Evaluating Multimodal Systems.
Start with high-grounding tasks: Document extraction and screenshot analysis are reliable because the expected output is verifiable.
Add human oversight for high-stakes tasks: Medical imaging and legal document review must include human verification.
Measure continuously: Track grounding quality over time as models update and user patterns shift.

→ Deploy multimodal capabilities where the image provides irreplaceable evidence — the best use cases pair clear grounding with measurable workflow improvement.

Python Example — Document Understanding Pipeline

from openai import OpenAI
import json

client = OpenAI()

def extract_invoice_data(image_url):
    """Extract structured data from an invoice image."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text",
                 "text": """Extract the following from this invoice:
- vendor_name: string
- invoice_number: string
- date: YYYY-MM-DD
- line_items: [{description, quantity, unit_price, total}]
- subtotal: number
- tax: number
- total: number

If any field is unreadable, set it to null.
Return valid JSON only."""},
                {"type": "image_url",
                 "image_url": {"url": image_url}}
            ]
        }],
        response_format={"type": "json_object"}
    )

    # Parse and validate the structured output
    data = json.loads(response.choices[0].message.content)
    # Flag any null fields for human review
    nulls = [k for k, v in data.items() if v is None]
    if nulls:
        data["_needs_review"] = nulls
    return data

Follow-up Questions

How do you justify the cost of multimodal models for document processing?

Calculate the cost per document with multimodal processing versus manual review. If a human takes 5 minutes per document at $30/hour, that is $2.50 per document. A multimodal API call might cost $0.05–$0.20 per document. Even with a human review step for flagged documents, the ROI is typically 5–20x.

What role does accessibility play as a multimodal use case?

Accessibility is one of the most impactful multimodal applications. Generating alt text for images, describing visual content for screen readers, and narrating video content directly improves access for visually impaired users. This use case also has relatively high error tolerance — an imperfect description is better than none — making it ideal for early deployment.

Should multimodal features be mandatory or opt-in for users?

Start with opt-in. Multimodal processing adds latency and cost, and not every workflow benefits. Let users attach images when they find it helpful. Monitor adoption rates and success metrics to identify where multimodal becomes the default. Mandatory multimodal input (e.g., requiring a photo upload) should only be enforced when the image is truly essential to the task.