EVMbench: OpenAI and Paradigm Benchmark AI Agents for Smart Contract Security

February 19, 2026 By m.imtiyaz

Introduction

Smart contracts routinely secure over $100 billion in open-source crypto assets. With AI agents becoming increasingly capable of reading, writing, and executing code, measuring their ability to operate in economically meaningful environments has never been more critical. Enter EVMbench, a new benchmark developed by OpenAI in collaboration with Paradigm that evaluates AI agents’ capabilities in detecting, patching, and exploiting high-severity smart contract vulnerabilities.

This article explores what EVMbench is, how it works, what the results mean for developers, and why it represents a significant milestone in the intersection of AI and blockchain security.

Why Smart Contract Security Matters

The financial stakes in smart contract security are enormous. Billions of dollars in assets flow through DeFi protocols, and a single vulnerability can result in catastrophic losses. Traditional security audits, while essential, cannot scale to meet the pace of development across dozens of chains and thousands of protocols.

AI agents offer a potential solution—but how do we measure their effectiveness? That’s the question EVMbench aims to answer.

The Financial Stakes

Metric	Value
Total value secured by smart contracts	$100B+
Average DeFi hack (2024)	$50M+
Code audit backlog	Months of waiting

What is EVMbench?

EVMbench is an open evaluation framework designed to measure how well AI agents can:

Detect vulnerabilities in smart contract codebases
Patch vulnerable contracts while preserving functionality
Exploit vulnerabilities in sandboxed environments

What Makes EVMbench Different?

Unlike synthetic benchmarks, EVMbench uses real vulnerabilities sourced from:

Code4rena auditing competitions (majority)
Tempo blockchain audit scenarios
Historical smart contract exploits

By the Numbers

Metric	Value
Total vulnerabilities	120
Source audits	40
Evaluation modes	3
Chain supported	EVM-compatible

The Three Evaluation Modes

1. Detect Mode

Agents audit a smart contract repository and are scored on their ability to recall ground-truth vulnerabilities and associated audit rewards.

How It Works

Agent receives complete codebase
Agent identifies security issues
Score based on recall of known vulnerabilities

Key Insight

“Agents sometimes stop after identifying a single issue rather than exhaustively auditing the codebase.”

2. Patch Mode

Agents modify vulnerable contracts to eliminate exploitability while preserving intended functionality.

How It Works

Agent receives vulnerable contract
Agent implements fix
Automated tests verify: (a) functionality preserved, (b) vulnerability eliminated

Key Insight

“Maintaining full functionality while removing subtle vulnerabilities remains challenging.”

3. Exploit Mode

Agents execute end-to-end fund-draining attacks against deployed contracts in a sandboxed blockchain environment.

How It Works

Contract deployed to local Anvil instance
Agent writes and executes exploit code
Grading performed via transaction replay and on-chain verification

Key Insight

“The objective is explicit: continue iterating until funds are drained.”

Performance Results

Current Model Rankings

Model	Exploit Mode	Release Date
GPT-5.3-Codex	72.2%	2026
GPT-5	31.9%	Mid-2025

Performance Analysis

Exploit Mode: Strongest Performance

AI agents excel when the objective is clear and measurable. Exploit mode provides explicit feedback—either funds are drained or they aren’t—making it easier for agents to iterate toward success.

Detect Mode: Room for Improvement

Agents struggle with exhaustive auditing. Common issues:

Stopping after finding one vulnerability
Missing edge cases in complex contracts
Overlooking low-severity issues that compound

Patch Mode: The Hardest Challenge

Balancing security fixes with functional preservation is difficult. Agents often:

Introduce breaking changes
Over-patch, breaking legitimate functionality
Miss subtle logic vulnerabilities

Technical Implementation

Evaluation Framework

OpenAI developed a Rust-based harness that:

Deploys contracts deterministically
Replays agent transactions
Restricts unsafe RPC methods
Provides reproducible results

Sandbox Environment

// Exploit mode runs in isolated environment Environment: - Local Anvil instance (not mainnet) - Historical vulnerabilities only - Single-chain support - Mock contracts for some scenarios

Task Creation Process

Adapt existing PoC exploit tests
Manually write scenarios where no PoC exists
Ensure vulnerabilities are exploitable
Verify patches don’t break compilation
Red-team environments to prevent cheating

Limitations and Scope

What’s Included

120 curated vulnerabilities from real audits
Historical, publicly documented issues
Payment-oriented smart contract scenarios (Tempo)
Automated grading infrastructure

What’s Excluded

Mainnet forks (clean Anvil instance only)
Multi-chain environments
Timing-dependent behaviors
Zero-day vulnerabilities
Human-discovered issues not in competitions

Grading Limitations

Detect mode: Only scores known vulnerabilities; agent-found additional issues unverified
Exploit mode: Sequential transaction replay; no parallel execution scenarios
Patch mode: May miss subtle functional regressions

Why This Matters for Developers

AI-Assisted Auditing is Coming

EVMbench signals that AI agents will increasingly participate in smart contract security. For developers, this means:

Opportunities

Faster initial vulnerability discovery
Automated regression testing
Second-pass auditing before human review
Continuous security monitoring

Challenges

Understanding AI limitations
Validating AI-generated fixes
Integrating AI tools into workflows
Maintaining security expertise

The Dual-Use Reality

AI that can audit can also exploit. EVMbench results show:

72.2% exploit success is high enough to matter
Defensive use must outpace offensive capability
Monitoring and safeguards are essential

Practical Implications

For Security Researchers

EVMbench provides standardized evaluation
Compare AI tools objectively
Identify capability gaps
Guide R&D priorities

For Protocol Teams

AI can supplement human audits
Use for pre-audit scanning
Automate regression testing
Validate patches before deployment

For AI Developers

Clear metrics for improvement
Open-source tooling available
Community benchmark participation
Real-world impact assessment

The Path Forward

What’s Needed

Improvement Area	Current State	Goal State
Exhaustive detection	Misses issues	Complete codebase coverage
Safe patching	Sometimes breaks functionalityAlways preserve behavior
Multi-chain	EVM only	Cross-chain security
Timing attacks	Out of scope	Full execution models

OpenAI’s Commitments

Release EVMbench tasks and tooling
$10M in API credits for cyber defense
Aardvark security research agent (private beta)
Free codebase scanning for widely used projects

Getting Started with EVMbench

For Researchers

# Clone the evaluation framework git clone https://github.com/openai/evmbench cd evmbench  # Install dependencies cargo install evmbench-harness  # Run evaluation evmbench evaluate --model gpt-5.3-codex --mode exploit

For Developers

Review EVMbench paper (PDF available)
Test your contracts against evaluation tasks
Submit improvements to the benchmark
Apply for API credits via Cyber Security Grant Program

Conclusion

EVMbench represents a significant step forward in measuring AI capabilities for smart contract security. With GPT-5.3-Codex achieving 72.2% exploit success and clear improvement trajectories, the writing is on the wall: AI will play an increasingly important role in keeping DeFi secure.

For developers, the message is clear: embrace AI-assisted security tools, but don’t abandon human expertise. The most robust security posture combines AI’s speed with human judgment.

The benchmark is open-source and ready for community contribution. Whether you’re building AI models, developing smart contracts, or securing DeFi protocols, EVMbench offers a common framework for measuring progress and identifying gaps.

The future of smart contract security isn’t human versus AI—it’s human and AI working together to build more secure systems.

Resources

EVMbench Paper: Available on OpenAI’s CDN (PDF)
GitHub Repository: github.com/openai/evmbench
Paradigm Announcement: paradigm.xyz/2026/02/evmbench
Cyber Security Grant Program: Apply via OpenAI

Blockcritics Alerts / Sign-up to get alerts on hackathons, new products, apps, contracts, protocols and breakthroughs in web 3.0.

Subscribe to Stay Upto-Date on Lab Tools & Services

Introduction

Why Smart Contract Security Matters

The Financial Stakes

What is EVMbench?

What Makes EVMbench Different?

By the Numbers

The Three Evaluation Modes

1. Detect Mode

How It Works

Key Insight

2. Patch Mode

How It Works

Key Insight

3. Exploit Mode

How It Works

Key Insight

Performance Results

Current Model Rankings

Performance Analysis

Exploit Mode: Strongest Performance

Detect Mode: Room for Improvement

Patch Mode: The Hardest Challenge

Technical Implementation

Evaluation Framework

Sandbox Environment

Task Creation Process

Limitations and Scope

What’s Included

What’s Excluded

Grading Limitations

Why This Matters for Developers

AI-Assisted Auditing is Coming

Opportunities

Challenges

The Dual-Use Reality

Practical Implications

For Security Researchers

For Protocol Teams

For AI Developers

The Path Forward

What’s Needed

OpenAI’s Commitments

Getting Started with EVMbench

For Researchers

For Developers

Conclusion

Resources

Leave a Reply Cancel reply