EVMbench: OpenAI and Paradigm Benchmark AI Agents for Smart Contract Security

Introduction

Smart contracts routinely secure over $100 billion in open-source crypto assets. With AI agents becoming increasingly capable of reading, writing, and executing code, measuring their ability to operate in economically meaningful environments has never been more critical. Enter EVMbench, a new benchmark developed by OpenAI in collaboration with Paradigm that evaluates AI agents’ capabilities in detecting, patching, and exploiting high-severity smart contract vulnerabilities.

This article explores what EVMbench is, how it works, what the results mean for developers, and why it represents a significant milestone in the intersection of AI and blockchain security.

Why Smart Contract Security Matters

The financial stakes in smart contract security are enormous. Billions of dollars in assets flow through DeFi protocols, and a single vulnerability can result in catastrophic losses. Traditional security audits, while essential, cannot scale to meet the pace of development across dozens of chains and thousands of protocols.

AI agents offer a potential solution—but how do we measure their effectiveness? That’s the question EVMbench aims to answer.

The Financial Stakes

Metric Value
Total value secured by smart contracts $100B+
Average DeFi hack (2024) $50M+
Code audit backlog Months of waiting

What is EVMbench?

EVMbench is an open evaluation framework designed to measure how well AI agents can:

  1. Detect vulnerabilities in smart contract codebases
  2. Patch vulnerable contracts while preserving functionality
  3. Exploit vulnerabilities in sandboxed environments

What Makes EVMbench Different?

Unlike synthetic benchmarks, EVMbench uses real vulnerabilities sourced from:

  • Code4rena auditing competitions (majority)
  • Tempo blockchain audit scenarios
  • Historical smart contract exploits

By the Numbers

Metric Value
Total vulnerabilities 120
Source audits 40
Evaluation modes 3
Chain supported EVM-compatible

The Three Evaluation Modes

1. Detect Mode

Agents audit a smart contract repository and are scored on their ability to recall ground-truth vulnerabilities and associated audit rewards.

How It Works

  • Agent receives complete codebase
  • Agent identifies security issues
  • Score based on recall of known vulnerabilities

Key Insight

“Agents sometimes stop after identifying a single issue rather than exhaustively auditing the codebase.”

2. Patch Mode

Agents modify vulnerable contracts to eliminate exploitability while preserving intended functionality.

How It Works

  • Agent receives vulnerable contract
  • Agent implements fix
  • Automated tests verify: (a) functionality preserved, (b) vulnerability eliminated

Key Insight

“Maintaining full functionality while removing subtle vulnerabilities remains challenging.”

3. Exploit Mode

Agents execute end-to-end fund-draining attacks against deployed contracts in a sandboxed blockchain environment.

How It Works

  • Contract deployed to local Anvil instance
  • Agent writes and executes exploit code
  • Grading performed via transaction replay and on-chain verification

Key Insight

“The objective is explicit: continue iterating until funds are drained.”

Performance Results

Current Model Rankings

Model Exploit Mode Release Date
GPT-5.3-Codex 72.2% 2026
GPT-5 31.9% Mid-2025

Performance Analysis

Exploit Mode: Strongest Performance

AI agents excel when the objective is clear and measurable. Exploit mode provides explicit feedback—either funds are drained or they aren’t—making it easier for agents to iterate toward success.

Detect Mode: Room for Improvement

Agents struggle with exhaustive auditing. Common issues:

  • Stopping after finding one vulnerability
  • Missing edge cases in complex contracts
  • Overlooking low-severity issues that compound

Patch Mode: The Hardest Challenge

Balancing security fixes with functional preservation is difficult. Agents often:

  • Introduce breaking changes
  • Over-patch, breaking legitimate functionality
  • Miss subtle logic vulnerabilities

Technical Implementation

Evaluation Framework

OpenAI developed a Rust-based harness that:

  • Deploys contracts deterministically
  • Replays agent transactions
  • Restricts unsafe RPC methods
  • Provides reproducible results

Sandbox Environment

// Exploit mode runs in isolated environment Environment: - Local Anvil instance (not mainnet) - Historical vulnerabilities only - Single-chain support - Mock contracts for some scenarios

Task Creation Process

  1. Adapt existing PoC exploit tests
  2. Manually write scenarios where no PoC exists
  3. Ensure vulnerabilities are exploitable
  4. Verify patches don’t break compilation
  5. Red-team environments to prevent cheating

Limitations and Scope

What’s Included

  • 120 curated vulnerabilities from real audits
  • Historical, publicly documented issues
  • Payment-oriented smart contract scenarios (Tempo)
  • Automated grading infrastructure

What’s Excluded

  • Mainnet forks (clean Anvil instance only)
  • Multi-chain environments
  • Timing-dependent behaviors
  • Zero-day vulnerabilities
  • Human-discovered issues not in competitions

Grading Limitations

  • Detect mode: Only scores known vulnerabilities; agent-found additional issues unverified
  • Exploit mode: Sequential transaction replay; no parallel execution scenarios
  • Patch mode: May miss subtle functional regressions

Why This Matters for Developers

AI-Assisted Auditing is Coming

EVMbench signals that AI agents will increasingly participate in smart contract security. For developers, this means:

Opportunities

  • Faster initial vulnerability discovery
  • Automated regression testing
  • Second-pass auditing before human review
  • Continuous security monitoring

Challenges

  • Understanding AI limitations
  • Validating AI-generated fixes
  • Integrating AI tools into workflows
  • Maintaining security expertise

The Dual-Use Reality

AI that can audit can also exploit. EVMbench results show:

  • 72.2% exploit success is high enough to matter
  • Defensive use must outpace offensive capability
  • Monitoring and safeguards are essential

Practical Implications

For Security Researchers

  • EVMbench provides standardized evaluation
  • Compare AI tools objectively
  • Identify capability gaps
  • Guide R&D priorities

For Protocol Teams

  • AI can supplement human audits
  • Use for pre-audit scanning
  • Automate regression testing
  • Validate patches before deployment

For AI Developers

  • Clear metrics for improvement
  • Open-source tooling available
  • Community benchmark participation
  • Real-world impact assessment

The Path Forward

What’s Needed

Improvement Area Current State Goal State
Exhaustive detection Misses issues Complete codebase coverage
Safe patching Sometimes breaks functionalityAlways preserve behavior
Multi-chain EVM only Cross-chain security
Timing attacks Out of scope Full execution models

OpenAI’s Commitments

  • Release EVMbench tasks and tooling
  • $10M in API credits for cyber defense
  • Aardvark security research agent (private beta)
  • Free codebase scanning for widely used projects

Getting Started with EVMbench

For Researchers

# Clone the evaluation framework git clone https://github.com/openai/evmbench cd evmbench  # Install dependencies cargo install evmbench-harness  # Run evaluation evmbench evaluate --model gpt-5.3-codex --mode exploit

For Developers

  • Review EVMbench paper (PDF available)
  • Test your contracts against evaluation tasks
  • Submit improvements to the benchmark
  • Apply for API credits via Cyber Security Grant Program

Conclusion

EVMbench represents a significant step forward in measuring AI capabilities for smart contract security. With GPT-5.3-Codex achieving 72.2% exploit success and clear improvement trajectories, the writing is on the wall: AI will play an increasingly important role in keeping DeFi secure.

For developers, the message is clear: embrace AI-assisted security tools, but don’t abandon human expertise. The most robust security posture combines AI’s speed with human judgment.

The benchmark is open-source and ready for community contribution. Whether you’re building AI models, developing smart contracts, or securing DeFi protocols, EVMbench offers a common framework for measuring progress and identifying gaps.

The future of smart contract security isn’t human versus AI—it’s human and AI working together to build more secure systems.

Resources

  • EVMbench Paper: Available on OpenAI’s CDN (PDF)
  • GitHub Repository: github.com/openai/evmbench
  • Paradigm Announcement: paradigm.xyz/2026/02/evmbench
  • Cyber Security Grant Program: Apply via OpenAI
Blockcritics Alerts / Sign-up to get alerts on hackathons, new products, apps, contracts, protocols and breakthroughs in web 3.0.

Leave a Reply