Will It Mythos?

A deep dive into whether the "Mythos" model is a security superpower or just a marketing masterclass.

The Mythos Mystery

There is a certain aura around Mythos. It is touted as a tool capable of uncovering security vulnerabilities that would baffle other systems. Because of this supposed power, it has been ~~cordoned off from the general public~~ restricted to a select few, ostensibly to prevent the world from being overwhelmed by its exploit-finding capabilities.

However, I remain skeptical of the official narrative. My theory? It is likely far more expensive to operate than standard models, and the capacity to scale it is simply not there yet. This leads to the central question: Is the ability to find security bugs actually unique to Mythos, or is this simply more AI hype?

Engineering the Benchmark

To find the answer, I leveraged Nelson, a tool I previously developed to automate bug hunting within my own software. Using Claude as a collaborator, I constructed a specialized benchmark suite.

The Corpus Construction

The goal was to create a "blind" test. I followed these steps:

Identify bugs specifically credited to Mythos via their official documentation.
Locate the specific commit existing before the fix was implemented.
Use Opus (version 4.7) to verify that the bug is identifiable if the model is explicitly told where to look.
Perform human spot-checks to ensure accuracy.

Crucial Detail: All 9 bugs in the current corpus were created after the knowledge cutoff for the tested models. This ensures the AI cannot simply recall the bug from its training data.

The Workflow

The logic of the benchmark can be visualized as follows:

Testing Parameters & Constraints

The models were placed in a controlled environment to simulate a realistic security audit.

The Environment

Infrastructure: Each model ran inside a fresh container.
Access: They received a sanitized full source checkout and the specific file to review.
Restrictions: The .git directory was deleted to prevent the models from analyzing commit history or "peeking" into the future.
Capabilities: They had network access (meaning they could potentially look up CVEs, though this is unlikely to be the primary driver).

The Methodology

The models were given the target file and a basic set of tools. No hints were provided other than the filename—which is standard practice in auditing. They could traverse the repository to understand logic across different files.

The Challenge: Finding a bug often requires understanding the broader context of how a function is used, which is a $\text{Hard Problem}$ for both humans and AI.

$\text{Detection Probability} (P_d) = \frac{\text{Bugs Identified}}{\text{Total Bugs in Corpus}}$

The "Agent" Experiment

I initially tested the models using two different configurations: a basic API harness and full-featured agents (either the vendor's preferred agent or Claude Code).

Agent Performance Comparison

Configuration	Performance	Cost/Tokens	Latency
Basic API	Baseline	Low	Fast
Full Agent	No Improvement / Slightly Worse	Very High	Slow

Conclusion on Agents: For most models, the agent wrapper added noise and cost without adding value. The only exception is the Claude family, which I ran via Claude Code because the subscription cost is significantly lower than API credits.

The Gemini/Antigravity Failure

I attempted to use agy (the Antigravity CLI for Gemini), but it proved entirely useless for security research. In 8 out of 9 instances, it responded with: "Sorry, I cannot fulfill your request to analyze the specified code file for exploitable security vulnerabilities."

Even when I stripped "trigger" words like exploitable or vulnerable, the model recognized the intent and refused. I eventually bypassed this by paying for direct API access via Google AI Studio, as agy is simply not fit for security work.

Final Thoughts

While this isn't a "smoking gun," the data is illuminating. The process was grueling—taking several hours over multiple days—though future runs will be faster thanks to new concurrency implementations.

def evaluate_model(model, corpus):
    results = []
    for bug in corpus:
        # Model attempts to find the bug blind
        finding = model.analyze(bug.file) 
        results.append(finding)
    return results

The fact that some bugs in the corpus are exceptionally difficult to locate suggests that Mythos might indeed possess a specialized edge. However, the rankings show that other top-tier models are competitive. Notably, GPT 5.5 Pro currently sits at the top of the leaderboard.