Yinghong Lan

Long-Form Video Understanding - Part 2: Evaluation and Benchmarks

Yinghong Lan — Wed, 17 Jun 2026 06:09:29 GMT

In my previous writeup, I demystified Multimodal LLM-as-a-Judge; in Part 1 on long-form video understanding, I discussed current bottlenecks and design choices. This post finishes the small trilogy, diving into evaluation and benchmarks, and tying the other two together.

Before we start, let me define benchmarks and evaluation:

A benchmark is typically two things: what you are measuring - a set of tasks; and how you measure - a way to score answers against those tasks. It’s just one component of evaluation.
Evaluation is the harder question of whether that score is actually useful to you. A benchmark can mislead on either side:
- Tasks may not require the capability you care about - e.g. answerable from one frame or the subtitles alone, when your use case needs real long-form understanding.
- Scoring - e.g. LLM judges for open-ended answers - may not be calibrated to your use case.

The rest of this writeup dissects both:

Benchmark survey: I categorize benchmarks by task and scoring complexity, and point out clear gaps and the reasons behind them.
Evaluation and Agentic Judge: there are clear gaps in open-ended, multi-hop questions - which is where a judge becomes essential. I connect back to my previous writeup.
Judge for training: I close with the training-time uses of a judge - for example, the agentic methods from Part 1.

Benchmarks for long-form video understanding

Two different kinds of property matter here:

Task axes - how hard the answer is to produce: where the evidence sits (scope), and how many reasoning steps it takes (depth).
Scoring axes - how hard the answer is to verify: whether it’s closed or open-ended, and whether it’s objective or subjective.

And, as we’ll see, the two can be quite correlated.

Task axes: evidence scope and compositional depth

The earlier survey by Tang et al. slices video tasks by granularity - abstract (e.g., video classification) vs. temporal (e.g., moment retrieval) vs. spatiotemporal (e.g., object tracking), along with varying degrees of language involvement. At a high level, this captures where the relevant evidence sits (scope). It’s a useful axis, but on its own it can mislead: where the evidence sits doesn’t tell you how hard the task is.

I think there’s a second, equally important dimension: how much planning and decomposition a task takes (depth) - i.e., how many steps the model has to chain together. Take two questions that both need the whole video (same scope), and both expect closed, objective answers:

“Does anyone in a red coat ever appear?” - a presence check: find one match and stop - shallow.
“How many distinct people appear, counting each only once?” - scan the whole video, re-identify, dedup, and tally. Several dependent steps - deep.

So I propose two task axes:

Evidence scope - where the relevant evidence sits: single frame → single clip → dispersed segments → whole video.
Compositional depth - how many steps the model must chain:
- L0 - Perceive: the relevant moment is specified; just recognize what’s there.
- L1 - Locate / retrieve: the target is specified; find whether and where it occurs.
- L2 - Single-hop reason: find the relevant evidence and reason once to an answer.
- L3 - Multi-hop / decompose: break into interdependent sub-questions, chain retrieve-then-reason steps, and aggregate.

Scoring axes: closed/open-ended and objective/subjective

To show that scoring is a separate dimension, take two questions of similar task difficulty:

“How many distinct people appear in the whole video?” - wide scope and multi-hop, but the answer is a number you can check automatically.
“Who are the distinct people, and what distinguishes each?” - similar underlying work, but now the answer is free-form text with no exact match; you need a mechanism to read and judge it.

So I propose two scoring axes:

Closed vs. open - can you auto-score it (exact match, tIoU), or do you need a model to read the answer (a judge)?
Objective vs. subjective - is there a single defensible answer, or legitimate variation? (”Is the cinematography good?” - even a perfect judge can’t guarantee agreement without more objective standards.)

It's worth noting that these axes together trace the line that separates verifiable from non-verifiable rewards in reinforcement learning.

Benchmark categorization

I categorize the major long-video benchmarks along all four axes below: scope and depth are a center-of-mass range (most benchmarks span several rungs); format and ground truth reflect each benchmark's primary scoring mode. (The full list of the 16 benchmarks is attached at the end)

A few patterns worth highlighting:

Almost every benchmark is closed and objective. Only MLVU and InfiniBench mix in open-ended questions - and those open slices are also the more subjective ones - while LVSum is the one recent benchmark that is squarely open-ended and subjective throughout.
Most benchmarks live at locate-and-retrieve (L1) or single-hop reasoning (L2). Genuine L3 - multi-hop decomposition - shows up only as subsets: the multi-detail split of MLVU, the event-linking and spoiler questions in InfiniBench, the adaptation track of Video-MMMU, route planning in VSI-Bench, forward-active responding in OVO-Bench.
Long is not deep. The longest benchmarks are still mostly L1-L2 - retrieve-and-answer over a long video, not decompose-and-reason: LVBench at ~68 min average, InfiniBench at ~53 min.
The benchmarks that aim deep often don’t test it. EgoSchema, Video-MMMU’s adaptation track, and OVO-Bench’s forward-active responding all target multi-step reasoning, but force it into multiple choice - where a model can pattern-match or eliminate its way to the answer without actual step-by-step decomposition.

Here is the evidence that the task and scoring axes are closely related: the only benchmarks that climb to L3 and whole-video scope are exactly the ones forced open-ended and subjective (LVSum, the open slices of MLVU and InfiniBench) - and there are barely any of them.

That isn’t a coincidence. Current benchmarks live in the closed + objective + low-depth corner, even as video length scales up, and the two halves reinforce each other. The closed + objective bias likely reflects a desire to grade cheaply, with no Multimodal LLM-as-a-Judge required. And that format, in turn, limits the depth you can actually test - a genuinely multi-hop question posed as multiple choice invites exactly the shortcutting above, and a deep, holistic task rarely has one objective answer. So the tasks that genuinely need long-form understanding - deep, whole-video, open-ended - are also the ones we can’t grade with a cheap exact-match check. Which is exactly where a judge comes in.

Evaluation and Agentic Judge

The benchmark gap above - open-ended, multi-hop questions - is exactly where auto-scoring runs out and you need a model to read the answer and judge it. This is where I’ll pick up the thread from my previous writeup on Multimodal LLM-as-a-Judge.

For multi-hop tasks specifically, a judge unlocks something a benchmark score can’t: process evaluation. Instead of only checking the final answer, it can verify the reasoning trajectory - did the model actually find the right evidence at each step, or did it guess its way to a plausible conclusion? In the judge writeup I called this agentic step verification.

That points to a design I find compelling for L3 questions: an agentic judge. A main judge decomposes the evaluation the way the multi-hop task decomposes - spawning smaller, specialized judges for each step and aggregating their verdicts. The decomposition that makes producing the answer hard is the same move that makes grading it tractable; it’s the natural extension of the rubric- and checklist-style judging I discussed before - breaking one fuzzy judgment into many checkable pieces.

But here’s the part I find most interesting, and it’s not just interpretability: decomposing the evaluation redistributes where the hard perceptual work lands. Take a genuinely L3 question over a two-hour film - “Why does the protagonist turn on their mentor in the final act?” - and say the model answers, “because the mentor’s betrayal, revealed at 1:12, recontextualizes the debt from act one.” A single judge would have to re-watch the whole film to grade that. An agentic judge instead asks a few different kinds of question:

Localization: do the cited scenes exist and show what’s claimed? - a couple of checks over short, pinpointed spans.
Grounding: does the 1:12 clip actually reveal the betrayal? - again, local.
Entailment: given those verified facts, does the “recontextualizes the debt” link actually follow? - pure text and logic, no video at all.
Completeness: did the answer miss another motive set up somewhere else in the film? - this one still needs to scan the whole thing.

Notice what happened: most of the grading collapsed into text-only logic checks - where a judge is very reliable - and short-span perceptual checks, where it’s far stronger than on long video. For all of those, the judge never has to take in the whole two hours. The long-video burden didn’t disappear, though - it got concentrated into that last check, which is where the real difficulty now lives.

And that difficulty is recall, not grounding. The localization and grounding checks only validate the evidence the model chose to cite - a model that points at the wrong scene can still pass every local check while having missed the real evidence. So the irreducible core is finding the evidence the answer should have used but didn’t, which means searching the whole video again. This is exactly where “agentic” differs from “checklist”: a checklist is static, but an agentic judge can retrieve spans on demand and actively hunt for that missed evidence - the same decompose-retrieve-aggregate loop the Part 1 agents use to answer, turned around to verify.

Two caveats, even so:

Every sub-judge needs its own calibration, and the errors compound up the tree - you’ve traded one shaky judgment for many.
If the judge runs on the same kind of model as the policy it grades, the two share a perceptual blind spot - the judge will quietly miss exactly the visual mistakes the model is most prone to.

So decomposing the evaluation narrows the blind spot to its hard core - whole-video recall, and the perception the judge shares with the model - but it doesn’t remove it.

Judge for training

So far I’ve treated the judge as a measuring instrument. But as I argued in the judge writeup, that’s the smaller half of its value - the more consequential use is at training time, and that’s where evaluation loops directly back into the design choices from Part 1.

Recall that Part 1 ended with trained agents: a model that decomposes a query, calls tools, and is optimized with RL or preference optimization. RL needs a reward, and for anything open-ended that reward is increasingly a judge.

This is where the agentic judge does something a final-answer judge can’t. A scalar “this answer is a 6” tells the model it was wrong but not where; per-step verdicts - localization fine, grounding fine, the step-2-to-3 link unsupported - tell it which step to fix. That’s the credit-assignment problem for multi-hop RL, and it’s exactly the kind of dense process reward the agentic methods in Part 1 are reaching for.

But the second caveat above turns into a real failure mode here. A process reward is only as honest as the judge behind it - and if that judge shares the policy’s perceptual blind spot, you’re rewarding the model for answers that look right to a grader that can’t see what’s wrong. The key unlock would be to de-correlate the judge’s perception gaps from the generator’s.

Closing of the trilogy

For current long-video benchmarks, length is no longer the bottleneck - we can already build two-hour benchmarks. The bottleneck is that the tasks we care about most - deep, whole-video, open-ended - and the only tool that could measure them at scale - a judge - are stuck in the same blind spot as articulated above.

That’s the connection I wanted to highlight across this trilogy:

Part 1 built increasingly sophisticated machinery for deep, long-context, open-ended problems.
The benchmarks surveyed in this post barely cover those.
And the judge that could fill the gap doesn't exist yet - the hard part is keeping it from being weakest in that very spot.

So perhaps the real frontier in long-form video understanding isn’t a longer benchmark or a bigger model - it’s an evaluation we can trust on the tasks we can’t yet measure. A well-calibrated, possibly multi-hop, genuinely multimodal judge is the most promising path I can see - but it’s far from solved.

Appendix: benchmark links

General long-video QA

Video-MME: Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (Fu et al. 2024)
LongVideoBench: LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding (Wu et al. 2024)
MLVU: MLVU: Benchmarking Multi-task Long Video Understanding (Zhou et al. 2025)
LVBench: LVBench: An Extreme Long Video Understanding Benchmark (Wang et al. 2025)
EgoSchema: EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding (Mangalam et al. 2023)
MVBench: MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (Li et al. 2024)

Movies / narrative

InfiniBench: InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows (Ataallah et al. 2025)

Knowledge-intensive

Video-MMMU: Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos (Hu et al. 2025)
MMVU: MMVU: Measuring Expert-level Multidiscipline Video Understanding (Zhao et al. 2025)
VideoKR-Eval: VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (Fu et al. 2026)

Spatial

VSI-Bench: Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces (Yang et al. 2025)

Needle-in-a-haystack

VideoNIAH / VNBench: Needle in a Video Haystack: a Scalable Synthetic Evaluator For Video MLLMs (Zhao et al. 2024)

Instance-level grounding

V2P-Bench: V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-model Interaction (Zhao et al. 2025)

Timestamp-aware summarization

LVSum: LVSum: A Benchmark for Timestamp-Aware Long Video Summarization (Patel et al. 2026)

Streaming

StreamingBench: StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding (Lin et al. 2024)
OVO-Bench: OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding (Li et al. 2025)

Long-Form Video Understanding - Part 1: Bottlenecks and Design Choices

Yinghong Lan — Mon, 15 Jun 2026 02:10:20 GMT

Bottlenecks: memory, compute, evaluation

Recently I have been hearing and reading seemingly contradicting opinions on long-form video (from tens of minutes to several hours) understanding, such as:

“Sweeping through the whole video is necessary - we should focus on making that as efficient as possible” vs. “there are many clever tricks to selectively retrieve - let’s explore those.”
“We should just keep improving MLLMs until they can handle everything” vs. “agents are the future of video understanding - let’s build more agent swarms.”

My thesis is that these views are not really disagreeing about what is true - they are making different tradeoffs about where to spend limited “budget”. Unlike a text document or dozens of images, a two-hour video breaks the memory and compute budget in the absence of intentional compressing, sampling, or retrieving. The contradictions above reflect different design choices to solve this core challenge, and there is no consensus yet on a universally optimal design. Quite the contrary - two distinct axes of tradeoff are being actively explored:

The memory axis. When you cannot afford to attend to everything, do you throw information away and lean on adaptive retrieval - or do you keep all the information and compress the attention/KV-cache? Two different answers to the same memory ceiling.
The compute axis. When you cannot yet compute the answer accurately in one pass, do you buy accuracy with an agentic system that runs many inferences - or do you internalize that agentic behavior into the model itself, so it's a learned, native capability rather than an external orchestration loop?

And there is a third bottleneck that is, as always, evaluation. The problem is some benchmarks do not control for the complexity and dependency of the tasks they bundle together, which makes “approach A beats approach B” claims much weaker than they look:

Complexity isn’t controlled. Even the benchmarks marketed as “long” rarely exceed an hour, while real production workloads often run for hours.
Dependency isn’t controlled. Plenty of “video” questions are anchored on a single frame or a few seconds, or are answerable from the transcript/subtitle alone, with no real long-range understanding required.

To keep this writeup focused and readable, I’ll use the rest of it to survey the design choices along these two axes, and leave evaluation and benchmarks for long-form video understanding to a separate future writeup (Part 2).

Sidebar clarification: technically one could push a two-hour video into a long-context model like Gemini. But it does not work reliably for tasks that genuinely require long-form temporal understanding, e.g., “what is the story arc for the character first appearing between 20:02 ~ 20:22 min in a blue coat?” While it’s hard to know the full details of these closed models, it is reasonable to conjecture that they are leaning on subtitles/ASR, metadata, or a handful of frames for most questions. As shown in recent benchmarks built from movies and TV shows (InfiniBench, Ataallah et al. 2025), models can score pretty well for certain tasks purely based on subtitles and metadata (related to the aforementioned issue of “dependency isn’t controlled” ). In summary, “feed the whole video to a frontier model” is not a silver bullet for long-form video understanding.

Memory design choice: discard vs. keep

Of course, solutions are not as simple as keep nothing vs. keep everything - there is a full spectrum, running from aggressively throwing information away to keeping all of it and paying the cost somewhere else.

Aggressively discard: adaptive selection of frames, clips, and patches

Instead of uniform sampling, prev work has proposed selecting only what matters:

Lightweight learned selectors of frames (M-LLM Based Video Frame Selection, Hu et al. 2025)
Training-free key clip selection that keeps short coherent segments instead of isolated frames (From Frames to Clips, Sun et al. 2025)
RL for samplers (Temporal Sampling Policy Optimization, Tang et al. 2025) where an event-aware “temporal agent” is trained for keyframe selection
Reasoning driven sampling that traverses coarse summaries, refines its focus, and halts once it has enough evidence (LongVideo-R1, Qiu et al. 2026)
Joint RL training of the sampler and the model (MSJoE, Tan et al. 2026)

There is a second, somewhat orthogonal design choice here: most selectors are “query conditioned”, meaning they will select frames / clips based on the user question; while others are “query agnostic”. For example, Attend Before Attention / AutoGaze (Shi et al. 2026) does patch-level pre-encoder selection, trained to keep the minimal set of patches that still reconstructs each frame within an error budget. Depending on the application, “query conditioned” tends to win on accuracy, but “query agnostic” is the only option when there is no query up front (e.g. building an index).

Shrink instead of drop: more efficient representations

Instead of hard-dropping frames, we can shrink the representation itself - and there's more than one way to do that:

Compression: Hierarchical Differential Distillation / ViLAMP (Cheng et al. 2025) proposed a “mixed precision” style approach to keep keyframes intact and compress the rest at the patch level
Pooling: LVC (Wang et al. 2025) studied retrofitting long-form video understanding capabilities onto existing VLMs by query-weighted pooling, collapsing windows of densely-sampled frames into a handful of “pseudo-frames”
Cheaper encoding: LiteFrame (Kim et al. 2026) points out that, with aggressive “post-hoc” visual token reduction (after feature extraction), the bottleneck moves to the per-frame vision encoder. So it distills a more efficient vision encoder for better latency-accuracy tradeoffs. (Note: this line of work can help regardless of keep or discard since it sits upstream)

Keep-it-all: cheaper attention

Another school of thinking is we shouldn’t drop anything, because whatever you discard early may very well be what the question turns out to need later. InternVideo3 (Yan et al. 2026) explicitly rejects “aggressive frame subsampling, retrieval, or summarization” and instead proposes an attention re-parameterization (Multimodal Multi-head Latent Attention, M2LA) that compresses KV-cache while preserving the full multimodal token stream. It's the clearest video-native instance of the keep-everything bet, potentially inspired by the broader efficient-long-context-attention line (latent / multi-head latent attention).

An orthogonal lever: faster decoding

Regardless of keep or discard, speculative decoding can speed up models’ token-by-token generation in lossless ways. ParallelVLM (Kong et al. 2026) - following the SpecVLM (Ji et al. 2025) work - parallelizes the draft-then-verify pipeline to speed up decoding.

Streaming videos: a routing problem

Streaming is where keep-it-all is not an option since frames keep arriving forever - the memory budget is bounded while the input is not. A body of work (CurveStream, Wang et al. 2026; FluxMem, Xie et al. 2026; VAM, Li et al. 2026; OASIS, Liang et al. 2026) has converged on “hierarchical memory”: maintain a fixed budget and route incoming frames into keep vs. discard vs. not-sure buckets. It is worth noting some failure modes here: memory built out of the previous model-generated narrations can compound errors, and naive retrieval into the context window can contaminate reasoning. To put it simply, bad memory is worse than less memory.

Compute design choice: externalizing computation vs. internalizing it

This axis of design choice boils down to: for long-form video understanding, there is a need for step-by-step computation (e.g., decompose the query → retrieve evidence → reason through the evidence → final answer). We have to decide where and how this type of step-by-step computation happens.

Option 1 - externalize into language / symbolic intermediates

This is the classic option: turn the video into text (video caption, audio transcription) or structured symbols (object tracks, bounding boxes) and then invoke a text LLM to reason over these intermediate artifacts. Despite the fact that the video representation here is not adaptive (artifacts are fixed after the initial step), this option is actually pretty robust in practice - and thus is always worth baselining with. As mentioned above, subtitle/ASR alone is very competitive for some tasks. In addition, ObjectMLLM (Tang et al. 2025) found that explicit object structure remains necessary and the best way to feed it is as plain text instead of distributed visual embeddings.

Option 2 - externalize into an agent loop

The agent loop is typically: give the model tools - crop the video, retrieve a clip, run a detector - then let it observe, reason, act, and repeat across rounds. The difference across work is how the agent's policy is obtained.

Prompted / training-free orchestration - the policy is hand-built or zero-shot prompted:

VideoMind (Bhatnagar et al. 2026) - a single MLLM plays multiple roles: decomposes the query into sub-queries and switches between operational modes (multi-scale temporal search vs. single-frame visual detail), allocating compute on the fly.
Deep Video Discovery (Zhang et al. 2025) - first indexes the video into a multi-granular, searchable database (segmented clips → captions/embeddings), then lets an LLM agent autonomously search and retrieve over it with tools, rather than following a fixed retrieval procedure.

Trained policy - a clear recent trend: instead of fixing the scaffold by hand, train the agent's policy. The recipe usually starts with an SFT cold start (imitate expert traces to teach a tool's format and semantics), then a policy-optimization step - and there are two flavors of that step:

Trajectory preference optimization - VideoExplorer (Yuan et al. 2025) intertwines planning, temporal grounding, and re-perception in one loop; after SFT, it applies trajectory-level DPO (TDPO) to reward faithful full trajectories and penalize flawed reasoning paths.
Online RL (GRPO) - LongVT (Yang et al. 2026) trains a native "crop-and-re-inspect" tool for a global-skim-then-local-zoom loop; VideoSeeker (Zhao et al. 2026) trains instance-level view/crop tools driven by visual prompts - among a fast-growing cluster (for example see also LongVideo-R1 mentioned above).

It’s worth noting that training recipe is finicky here: base models don’t use tools on their own - SFT is needed to ground the tool. An explicit “use the tool” reward yields little gain once SFT has grounded the tool, and naive recall-based grounding rewards get hacked easily (IoU-style reward is better). Following SFT, RL mostly helps get better agent policies for the same external loop, e.g. fewer wasted inference calls.

Option 3 - internalize the agent loop into the model

Option 3 takes the Option-2 trend one step further. The RL-trained agents above still run an explicit external loop - RL just gives them a better policy for it. Option 3 asks: once the policy lives in the weights, why keep the external loop at all? - and collapses the step-by-step computation into the MLLM's own forward pass, so the model learns the agentic behavior rather than relying on an orchestration wrapped around it. Similar to everything else, there is a spectrum of how far you take this:

Latent reasoning - instead of emitting explicit tool calls or text, just do the intermediate reasoning in continuous hidden states. In theory this can be efficient and fully end-to-end, but in practice supervision, training, and generalization may get very challenging - and that’s perhaps why I couldn’t find a good recent example for long-form video understanding.
Internalized discrete operations - instead of executing operations externally, keep the operations internal and interpretable. ATLAS (Guo et al. 2026) is a good recent example: it represents each visual operation as a single discrete “functional token” - no external tool call, no context-switching, yet still an interpretable trace. However, it is done on image reasoning - not long-form video yet.
Internalized full reasoning loop - InternVideo3 (Yan et al. 2026), mentioned above in the memory section, formulates “Multimodal Contextual Reasoning”: observe → reason → act → update, inside one context rather than an external loop. It’s worth noting that this is made possible by the attention trick (M2LA). The model can still call tools, but the agent loop is now a native property of the model’s context, not an external scaffold.

Frontier labs might be converging on this last approach, for example, Kimi K2.5 (Kimi Team 2026) starts from a native multimodal model and trains its agentic behavior in - its multi-agent “swarm” orchestrator is RL-trained rather than hand-built.

However, one question still remains unsettled: does internalizing the agent loop into the model always beat external agent loops? Or are we just hiding the same computation behind a different interface (now the model itself)? Does a fundamentally different approach exist, yet to be fully explored?

This last approach is also a good example to show that the memory and the compute axes are never fully independent from each other: InternVideo3 pairs “Multimodal Contextual Reasoning” with “Multimodal Multi-head Latent Attention”. The third axis, evaluation, is deeply entangled with everything discussed here too - I will cover that in a future writeup (Part 2).

Introduction to (Multimodal) LLM-as-a-Judge

Yinghong Lan — Sun, 14 Jun 2026 01:27:08 GMT

This writeup is an introduction on (Multimodal) LLM-as-a-Judge - a wide overview rather than a deep technical discussion.

Generation-Verification Asymmetry

Let’s begin by addressing this common question: if we provide the same context to both the generator and judge, why would a (Multimodal) LLM-as-a-Judge add value? Below are some common reasons:

Verification is often easier than generation - a common metaphor here is “more people can critique and appreciate great artwork than create it.” The judge does not need to generate high quality and comprehensive answers - it just needs to recognize quality or gaps in one.
“Providing the same context” is not exactly true. The judge receives the generator’s output - e.g., retrieved frames, reasoning path, and final conclusions - in addition to the original context. Furthermore, compared to the original context, the additional artifact - the generator’s output - tends to be more specific to the actual problem to solve. The judge can compare it against specific guardrails and rubrics, and check consistency and gaps.
You can have multiple judges, one for each specific dimension, thereby breaking down complex matrices of quality and consistency requirements into more tractable metrics. In contrast, generators need to balance all these requirements in their output.
The generator often commits sequentially - token by token, chunk by chunk. The judge, in contrast, can review the final output holistically and catch errors or inconsistencies at a higher level.

Multifaceted Value of LLM-as-a-Judge

Next, let’s demystify a common misconception: that a judge is only useful for evaluation. In practice, LLM-as-a-Judge has many application scenarios across both online / inference and offline / training:

Online / Inference time

Quality Control: the judge can reject outputs that fail predefined quality rubrics, or escalate them to a human-in-the-loop - e.g., rejecting a multimodal agent’s answer if it isn’t grounded in the retrieved frames.
Best-of-N selection: the judge can pick the best from multiple candidates (or reasoning trajectories) the generator outputs - e.g., sampling five reasoning paths through a video and selecting the one with the highest grounding and consistency scores.
Self-refinement loops: the judge critiques the generator’s first-pass output (”reasoning skipped frames 30–45”) and the same generator revises with the judge’s feedback, iterating until the output clears the predefined quality bar.
Input into a downstream editor / post-processor: similar to self-refinement, except the judge’s feedback - e.g., missing visual elements, weak grounding, hallucinated entities - goes to a separate editor / post-processor, which fixes the issues directly rather than regenerating from scratch.
Agentic step verification: beyond judging the final output, the judge can validate each intermediate action - tool call, retrieved frame, reasoning step - before the agent commits to the next one, catching errors mid-trajectory rather than after the full answer is produced.

Offline / Training time

Training data filter: the judge can help filter existing human or synthetic data - e.g., removing flawed, ungrounded, or unverifiable reasoning trajectories - to curate higher quality training datasets.
Synthetic annotator: the judge can help annotate final outputs, trajectories, or intermediate steps - e.g., labeling (query, agent trajectory, final output) triples - to scale training data for the generator beyond what human annotators can produce.
Reward function for reinforcement learning: the judge can provide scalar rewards or preference pairs (chosen vs. rejected) for various RL methods, scaling beyond what human preference labeling can support.

Examples of Multimodal LLM-as-a-Judge

LLM-as-a-Judge can be applied across a diverse set of problems - for this writeup, I want to specifically discuss Multimodal LLM-as-a-Judge for multimodal understanding.

MLLM-as-a-Judge (Chen et al. 2024) - the first comprehensive study of Multimodal LLM-as-a-Judge - built human-annotated benchmarks for image-instruction pairs spanning image captioning, math reasoning, text reading, and infographics understanding. It assessed MLLM judgment's alignment with human annotators across three settings: scoring evaluation, pairwise comparison, and batch ranking. It showed that while MLLMs are closer to human judgment on pairwise comparison, there are still significant gaps in scoring and batch ranking. Furthermore, MLLM-as-a-Judge exhibits various biases (position bias, length bias, and self-preference), hallucinations, and inconsistencies.

JudgeAnything (Pu et al. 2025)'s TaskAnything benchmark spans 15 any-to-any modality categories - for both generation and understanding - and its JudgeAnything evaluates judging over those tasks. The paper showed that while MLLM-as-a-Judge is promising for understanding (best performance is still with pairwise comparison), there are still significant challenges for generation. It also outlined three different judgment settings (how judges are elicited):

Overall: direct judging, where the judge directly provides reasoning and a final judgment
Rubric: the judge is required to judge based on fine-grained rubrics before making a final judgment
Checklist: the judge first evaluates against detailed checklists - curated through a human-in-the-loop process - before making a final judgment

Across these settings, the paper showed that MLLM-as-a-Judge can be enhanced by well-constructed rubrics and checklists.

A key characteristic of judging in multimodal understanding is that the judge is typically evaluating a text response conditioned on multimodal input. This has major implications - either the judge needs to be truly multimodal and rely on its own perception capabilities, or the judge is just grading text against a (possibly flawed) video description or audio transcription, i.e., a purely text-based LLM-as-a-Judge setup. The above two studies focused on the former, whereas the study below compared the two options.

VideoJudge (Waheed et al. 2025) introduced 3B/7B MLLM judges specialized to evaluate text responses conditioned on videos, with two notable findings:

Small specialized judges can match or surpass much larger general-purpose judges; furthermore, VideoJudge generates test-time rubrics for fine-grained, interpretable scoring.
Genuinely multimodal judges can outperform text-only LLMs that only see the text description, and long chain-of-thought reasoning is not a viable mitigation for the video perception gap.

The process to bootstrap MLLM-as-a-Judge training data in VideoJudge works as follows:

Start with seed data - human-provided gold responses - from three large-scale video instruction–response datasets (VideoInstruct-100K, VCG-Plus-112K, VideoChat2-IT); for multi-turn dialogues, only the first human–assistant exchange is used.
A (data) generator model produces (N−1) candidate responses, where N is the rating scale.
A (data) evaluator model rates each candidate response and provides the corresponding reasoning.
Compute the deviation between the generator’s rating and the evaluator’s assigned rating - for candidates with a large deviation, the generator is prompted again with the evaluator’s feedback to improve its response.

Evaluation and Calibration of Multimodal LLM-as-a-Judge

(Multimodal) LLM judges are not ground truth - they are also models requiring evaluation, calibration and quality control.

Evaluation of MLLM-as-a-Judge is anchored on agreement with human judgment, as scaling human annotation is the biggest motivation. As showcased in the studies reviewed above, standard practices are:

Curate a human/expert-annotated golden dataset.
- Not all human datasets are golden - it is critical to make sure humans agree on these annotations first. In addition, it is important to distinguish "low agreement from poor guidelines" (something we should fix) from "low agreement from intrinsic subjectivity" (a hard task-specific ceiling that indicates the task is worth breaking down further).
Measure agreement with human judgments. Similar to the point above, if the agreement rate is low - especially for clear-cut cases - the MLLM-as-a-Judge needs further iteration.
Evaluate the reasoning as well - rubric-level evaluation is crucial especially for production scenarios where the rationales for the final output also matter, not just the output itself.

Calibration and quality control of MLLM-as-a-Judge:

Confidence-based escalation: estimate the confidence of the judge and escalate to human evaluation when low, in order to guarantee a certain level of human agreement. Similarly, to further optimize scalability, smaller, faster judges can be deployed at scale and escalate to stronger, more time-consuming ones only when confidence is low.
Juries instead of judges: mix judges across different model families and use disagreement among them as a signal to flag low confidence / high ambiguity for human review. However, a jury of judges that have the same limitations (e.g., poor perception capabilities) will agree confidently and wrongly - instead of expanding the jury, we should prioritize fixing these limitations or relying on human-in-the-loop escalations.
Debiasing: randomize candidate ordering to counter position bias; intentionally control for length bias; systematically audit for self-preference.

It is critical to point out that optimizing generators against under-evaluated or under-calibrated judges is extremely harmful: going back to the multifaceted value of MLLM judges, using a poor judge for RL or training curation will greatly increase the risk of reward hacking or persistent model blind spots. For all these reasons, calibration and bias auditing must become prerequisites, not afterthoughts.

Limitations of Multimodal LLM-as-a-Judge

Reliability is task-dependent: There is no such thing yet as a “universally reliable” Multimodal LLM-as-a-Judge. Evaluation is intrinsically application-dependent - a feature, not a bug. Reliability of the judge for one task is never guaranteed to transfer to another.

Long-form video remains hard: VideoJudge showed that baseline MLLM judges drop substantially on LongVideoBench, even as their trained judges held up - underscoring that long-video judging remains hard without specialized adaptation.

Perceptual bias persists: A recent work, Perceptual Judgment Bias (Park et al. 2026), showed that MLLM judges tend to “reward plausible narratives over perceptually correct answers”, biasing toward text over visual evidence. This paper proposed a new dataset and a training framework to improve perceptual fidelity, pointing to opportunities for strengthening MLLM judges.

Taken together, these findings show that the asymmetry between generation and verification articulated earlier is likely larger for some tasks (factual consistency, rubric-checking) than others (long-context temporal grounding, fine-grained perception).