OpenAI and Paradigm Launch EVMbench to Test AI in Smart Contract Security

OpenAI and crypto investment firm Paradigm have announced EVMbench, a testing framework designed to measure how well AI agents handle smart contract security.

For readers outside the developer world, the point is straightforward: smart contracts now secure more than $100 billion in open crypto systems, and even one serious coding mistake can lead to major losses.

What EVMbench does

EVMbench tests AI systems in three modes:

Detect: can the AI find vulnerabilities in contract code?
Patch: can it fix the issue without breaking the product?
Exploit: can it reproduce a fund-draining attack in a sandboxed setting?

According to OpenAI, the benchmark uses 120 curated vulnerabilities from 40 audits, mostly from open audit competitions, plus additional payment-focused scenarios connected to Tempo.

Why this matters now

The release comes amid renewed concern around smart contract failures.

Recent incidents linked to smart contract flaws have reinforced a basic reality: AI can help defenders move faster, but attackers can also use the same tools.

In that context, EVMbench is less about headlines and more about measurement. It gives the industry a repeatable way to track whether AI models are improving at offensive and defensive security work.

The headline result

OpenAI says GPT-5.3-Codex scored 72.2% in exploit mode, compared with 31.9% for GPT-5 around six months earlier.

That gain suggests model capability in this area is moving quickly. But OpenAI also says detect and patch performance still falls short of full coverage, which means human review remains essential.

The bigger picture

This story is not simply “AI is good” or “AI is dangerous.” It is both:

AI can improve audits and strengthen deployed contracts.
The same capability can lower the barrier for exploitation if misused.

For teams building in blockchain, the practical takeaway is clear: AI-assisted security should be adopted, but with strict controls, testing discipline, and independent audits.

Read OpenAI's EVMbench announcement

Read Paradigm's EVMbench post

Try the EVMbench interface