LLM Evaluations Will Continue to Be Underserved

aillmsevalsagents

Originally published June 18, 2025 on Substack. Updated February 20, 2026 with new benchmarks, tooling, and the Gemini 3.1 Pro eval methodology.

What even is an LLM evaluation?

From OpenAI's evaluation documentation:

Evaluations (often called evals) test model outputs to ensure they meet style and content criteria that you specify. Writing evals to understand how your LLM applications are performing against your expectations, especially when upgrading or trying new models, is an essential component to building reliable applications.

This definition is a good technical starting point, but it can be readily expanded to include other elements. Rather than just content or style requirements, you might want to include other things such as:

Task Success and Functional Correctness

Many LLM applications are goal-oriented. Whether it's writing code, summarizing a conversation, generating a response to a customer query, or proposing a supply chain decision, your evaluation criteria should include whether the output actually does the job:

  • Is the output factually accurate?
  • Is it actionable, complete, and relevant to the input?
  • Would a human user consider this output useful?

Guardrails

In practice, LLM evals can serve as:

  • Regression tests, to ensure you haven't broken existing behaviors when tweaking prompts or upgrading models.
  • Model selection tools, helping compare how different LLMs perform under real use cases.
  • Feedback collectors, especially when instrumented in your app to gather user reactions and real-world errors.

Most people don't even deploy agents with evals (and that's a problem)

Let's be blunt: a shocking number of LLM agents get shipped without any real evaluation framework at all. Working in San Francisco there is a ship fast mentality with many startups in the agent space. As these scale up, try to sell to enterprises, there is a rude awakening waiting for them. People want to have confidence that what you are selling them works. What a concept.

Teams invest weeks tuning prompts, wiring up tools, designing clever planners, and orchestrating multi-step workflows β€” but then... they hit "deploy" and rely entirely on vibes.

No automated checks. No regression tests. No system for knowing whether today's outputs are better or worse than last week's. If something breaks, they notice because a user complains β€” or worse, doesn't.

This isn't just a technical gap β€” it's an organizational blind spot.

  • LLM behavior is probabilistic, not deterministic. Without evals, you have no baseline and no ability to measure drift.
  • Agentic systems introduce combinatorial complexity. More steps, more failure modes, more surface area. You need structure to manage it.
  • Tooling is evolving fast, and models change under your feet. If you're not evaluating, you're not even in the loop when the ground shifts.

And still, many teams treat evals like a "nice-to-have" or an afterthought. Why? Because they think of them as overhead. But the irony is: not evaluating is way more expensive in the long run. You pay for it in bug hunts, brittle UX, and eroded customer trust.

The best teams don't just ship evals β€” they design for them from the start. They define success before writing prompts. They build feedback capture into the app experience. And they track eval metrics like product KPIs.

Because at some point, it stops being about the model's capability and starts being about your system's reliability.

Resources to cover units of evaluation measure

When you implement any evaluation, you have a goal in mind for the evaluation but you must also measure it. Here are some resources as I won't get into too much detail about measurement here:

Automated Tests

Human-Labeled Goldsets

Model-Graded Evals (LLMJ)

Live Feedback (Production)

Going through these resources should be a good primer for what exists today or as a refresher.


Update: What's changed since June 2025

Since this article was originally published, the eval landscape has moved fast. Here's what's new:

New benchmarks for agentic systems

The conversation has shifted from "does the model respond well?" to "does the agent behave correctly, consistently, and safely over time?" Several new benchmarks reflect this:

  • FieldWorkArena, ECHO & Enterprise RAG Benchmark β€” Researchers at Carnegie Mellon and Fujitsu introduced benchmarks measuring whether AI agents are safe enough to run business operations without human oversight, presented at AAAI 2026 in Singapore. FieldWorkArena evaluates agents in logistics and manufacturing environments, while ECHO targets hallucination mitigation in vision-language models.
  • Microsoft's Multimodal Agent Score (MAS) β€” A unified, absolute measure of end-to-end conversational quality for agents operating across modalities.
  • AgentX–AgentBeats Competition (UC Berkeley) β€” A two-phase competition with $1M+ in prizes challenging teams to first build novel agent benchmarks, then build agents to excel on them.

Established benchmarks keep climbing

  • GAIA β€” At the highest difficulty (Level 3), Writer's Action Agent hit 61% in mid-2025, surpassing Manus AI (~57.7%) and OpenAI's Deep Research (~47.6%). Human experts still reliably hit 90%+.
  • SWE-bench β€” Claude Opus leads at 80.8%, with Gemini 3.1 Pro close behind at 80.6%. The coding agent eval race is tight.
  • Ο„-Bench / Ο„2-Bench β€” Multi-turn interaction simulations across retail support and airline booking continue to expose agent weaknesses in sustained dialogue.
  • ARC-AGI-2 β€” Gemini 3.1 Pro scored 77.1%, more than doubling Gemini 3 Pro's 31.1% and pulling well ahead of Claude Opus 4.6 (68.8%) and GPT-5.2 (52.9%). Reasoning benchmarks are moving fast.

The tooling has matured

The "just vibes" era is getting harder to justify. The tooling is catching up:

  • Deepchecks addresses evaluation at the system level β€” detecting hallucinations, factual inconsistencies, bias, and prompt sensitivity as continuous monitoring rather than one-time validation.
  • LangSmith combines detailed execution tracing with structured review for agent debugging.
  • RAGAS isolates retrieval quality from generation quality in RAG pipelines β€” answering whether your agent even retrieved the right context before generating.

How frontier labs publish evals: Gemini 3.1 Pro as a case study

One thing worth calling out β€” Google's Gemini 3.1 Pro eval methodology page is genuinely useful as a reference for how to think about evaluation rigor. It details:

  • Pass@1 methodology β€” all scores are single-attempt with default sampling settings, no majority voting or parallel test-time compute. This is the honest way to report.
  • Cross-model comparison methodology β€” they source non-Gemini results from providers' self-reported numbers and default to maximum thinking/reasoning settings for competitors. Transparent about what's apples-to-apples and what isn't.
  • Benchmark-by-benchmark breakdowns across reasoning (GPQA Diamond, Humanity's Last Exam, ARC-AGI-2), coding (LiveCodeBench Pro, SWE-bench, Terminal-Bench 2.0), and agentic tasks (MCP Atlas, BrowseComp, APEX-Agents).

Whether or not you're building on Gemini, the model card and eval methodology page are worth studying as a template for how to document and communicate evaluation results. If frontier labs are investing this much in structured eval reporting, the bar for the rest of us should be at least as high for our own systems.

The trend is clear: evaluation is becoming infrastructure, not an afterthought. The teams that treat it that way will be the ones still standing when the dust settles.