The Death Spiral of Eval Startups (2025)

By Liao | Published May 8th, 2025

Why is the landscape so devoid of independent evaluation (eval) startups?

Whenever a fresh AI trend emerges—be it autonomous agents, voice synthesis, or voice-enabled agents—a predictable pattern occurs. A group of entrepreneurs becomes convinced that the real goldmine isn't in building the models, but in identifying which models are the best and selling that insight to other developers. In essence, they try to commoditize the "eval."

I have witnessed this cycle throughout every single wave of generative AI, even before the term "generative AI" became a household phrase.

🚩 The Three Primary Hurdles

The failure of these ventures generally stems from three systemic pressures:

Talent Attrition: The best people leave for higher-leverage roles.
Market Mismatch: The target customer base is virtually non-existent.
Optimization Pressure: The "big labs" render benchmarks obsolete through gaming.

1. The Talent Drain: Evals vs. Post-Training

The skill set required to build a world-class evaluation framework is nearly identical to the skills needed for post-training and application development. However, the latter two capture significantly more value.

The Data Bottleneck

Creating a high-quality eval requires a rigorous data pipeline—either via synthetic generation or human-in-the-loop feedback. This is the exact same bottleneck faced during post-training.

Mathematically, if we assume the value per datapoint is constant ( $v$ ), the total value generated is: $\text{Total Value} = v \times \text{Quantity of Data}$

Since the volume of data required for post-training is orders of magnitude larger than that required for a benchmark, the value ceiling for eval-focused work is inherently lower.

Value Comparison Table

Feature	Eval Startups	Post-Training/App Dev
Primary Asset	Small, high-quality test sets	Massive, high-quality training sets
Financial Upside	Capped by contract size	Potential for $\text{millions} \rightarrow \text{billions}$
Influence	Observational/Advisory	Direct impact on model behavior
Talent Appeal	Low (Opportunity cost is too high)	High

Case in Point: Three researchers recently departed Epoch AI—where they were evaluating agents—to launch a startup focused on post-training tools for agents. They recognized that the leverage shifted from measuring the model to improving it.

2. The "Ghost" Customer Base

Even if a startup manages to keep its talent, it struggles to find a viable market. The intersection of people who are building on model APIs and people who are unable to evaluate models is almost empty.

The Technical Divide

There are essentially two types of potential customers:

The Technical Developer: If a developer understands the nuance of a $10\%$ improvement on AIME 2024 (computed via $\text{best-of-N}$ without tool use), they are already capable of running the eval themselves.
The Non-Technical Executive: If a client doesn't know the difference between GPT-4o and GPT-4.1, they aren't looking for an ELO rating or a technical breakdown; they want a finished product.

~~Eval startups try to sell technical features to people who want turnkey solutions.~~

The "Gartner" Problem

Market research firms like Gartner produce charts for corporate executives. These charts often feature:

X-axes that are purely fantastical.
Y-axes that are entirely fictional.

These "Magic Quadrants" are designed for people with the technical depth of a toddler. While Gartner can simplify things for executives signing cloud contracts, eval startups typically target developers—who are far too smart to pay for a service they can script in a weekend.

3. The "Big Lab" Gaming Machine

If a startup survives the talent and customer crises, it hits the final wall: The Big Labs. Labs are incentivized to climb public leaderboards at any cost, often using "unfair" advantages.

Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.

Tactics of the Labs:

Data Contamination: Training directly on test data (e.g., Meta's approach with Llama 1 and rumored tactics for Llama 4).
Internal Manipulation: Encouraging employees to vote for their own models on public boards.
The "Bribe": Offering free compute in exchange for favorable results.
Poaching: Hiring away the very people who designed the evals.

The Llama 4 / Chatbot Arena Scandal

For years, researchers wondered why every new release magically sat atop the LMSys Chatbot Arena. A recent Cohere report suggests systematic gaming:

Meta allegedly tested 27 different variants of Llama 4 before picking the one that scored highest.
The Llama 4 Maverick model was marketed as beating GPT-4.5, but this was achieved using a version optimized specifically for the Arena; the actual released version performed poorly.

Example of a "Gamed" Eval Result

{
  "model": "Llama-4-Maverick-Arena-Special",
  "benchmark": "Chatbot Arena",
  "score": "S-Tier",
  "status": "Not for public release",
  "note": "Optimized specifically for human preference patterns in the eval set"
}

🛡️ The Exception: Safety Evals

There is one area where eval startups can actually thrive: Safety Benchmarks.

Why? Because the incentives are different:

Ideology over Money: Safety researchers are often ideologically opposed to "capabilities" work. They won't migrate to post-training for a bigger paycheck because they don't want to make the models more powerful.
Specialized Demand: Technical clients who could replicate safety evals still pay for them because the expertise is highly specialized and the stakes (catastrophic risk) are higher than mere performance gains.