TREX: An AI Code Reviewer That Actually Executes Your Code

By Shlok Mehrotra | June 17, 2026

I am a software engineer at Greptile, and we have recently developed a new approach to pull request reviews. Instead of simply analyzing the text of a diff, we built a system that actually runs the code to demonstrate exactly where and why a failure occurs.

The Evolution of Code Inspection

Back in 1976, Michael Fagan introduced the concept of formal code inspection at IBM. The process was manual and rigorous:

Developers would print out physical code listings, gather in a room, and meticulously read through the logic line by line.

While modern AI tools have accelerated this process, most are still essentially doing the same thing: reading. While static analysis is great for obvious bugs, it hits a ceiling when dealing with "runtime-only" issues.

The "Invisible" Bug Category

There are certain defects that simply do not exist in the source code itself; they only emerge during execution. These include:

Logic Errors: Bugs requiring a specific sequence of state transitions.
UI Regressions: Visual glitches that only appear after the DOM has loaded.
Race Conditions: Timing issues that require a live network request to trigger.

~~Reading the diff perfectly~~ is not enough to catch these. This is why we created TREX (Test, Run, Execute)—an execution layer integrated directly into the review process.

Orchestrating Agents Without Context Bloat

The architecture of TREX evolved through three distinct phases. We initially struggled to balance agent autonomy with shared knowledge.

The Architectural Journey

Version	Design	The Problem	Result
v1	Separate Agents	No shared knowledge; overlapping work.	$\text{Wasted Compute} \uparrow$
v2	Single Giant Agent	Context window overload (logs + screenshots).	$\text{Performance} \downarrow$
v3	Orchestrator Model	Managing sub-agents from a primary agent.	$\text{Efficiency} \uparrow$

The Orchestration Workflow

In the current iteration, TREX is not a standalone product but a capability of the main Greptile reviewer. The process follows this logic:

By using this "agent-within-an-agent" approach, TREX sub-agents inherit the context found by the orchestrator but maintain their own dedicated context windows. For example, if a feature is hidden behind an auth gate, a sub-agent can independently handle the environment setup, authentication, and feature flag toggling to capture a screenshot of the rendered page.

Multi-Modal Artifacts: "Showing the Work"

Initially, TREX reported findings as simple bullet points (e.g., "Tested checkout flow, found failure"). This was insufficient and occasionally led to hallucinations, where the agent claimed to have tested something it hadn't.

To solve this, we implemented multi-modal artifact sets. Now, every finding is backed by evidence:

Visuals: Screenshots and videos (essential for animation changes).
Data: API traces and execution logs.
Reproducibility: The actual execution scripts used.

Why Artifacts Matter

This is analogous to grade school mathematics: you cannot verify if an answer is correct unless you see the steps taken to reach it.

$\text{Proof of Execution} = \text{Logs} + \text{Traces} + \text{Visuals}$

If an agent provides only the answer, the developer doesn't know where to fix the code. With the trace, the exact point of failure is exposed.

Verification Checklist for Reviewers:

Review the execution script for correctness.
Inspect the API trace for unexpected status codes.
Watch the video artifact for UI regressions.
Confirm the logs match the expected state transitions.

A Model-Agnostic Evaluation Harness

The AI landscape changes rapidly. A model that dominates coding tasks today might be obsolete tomorrow. To avoid being locked into a single provider, we built TREX on a model-agnostic harness.

This allows us to "hot-swap" models without rewriting the core system. Interestingly, we can mix and match:

Orchestrator Agent: Might use Model A for high-level reasoning.
TREX Sub-Agents: Might use Model B for specialized execution tasks.

{
  "config": {
    "orchestrator": "gpt-4o",
    "execution_agents": "claude-3-5-sonnet",
    "harness_version": "2.1.0"
  }
}

This flexibility ensures that as internal evaluations shift, we can always deploy the most capable model for each specific part of the review pipeline.