EVMbench: OpenAI and Paradigm Benchmark AI Agents for Smart Contract Security
Introduction
Smart contracts routinely secure over $100 billion in open-source crypto assets. With AI agents becoming increasingly capable of reading, writing, and executing code, measuring their ability to operate in economically meaningful environments has never been more critical. Enter EVMbench, a new benchmark developed by OpenAI in collaboration with Paradigm that evaluates AI agents’ capabilities in detecting, patching, and exploiting high-severity smart contract vulnerabilities.
This article explores what EVMbench is, how it works, what the results mean for developers, and why it represents a significant milestone in the intersection of AI and blockchain security.
Why Smart Contract Security Matters
The financial stakes in smart contract security are enormous. Billions of dollars in assets flow through DeFi protocols, and a single vulnerability can result in catastrophic losses. Traditional security audits, while essential, cannot scale to meet the pace of development across dozens of chains and thousands of protocols.
AI agents offer a potential solution—but how do we measure their effectiveness? That’s the question EVMbench aims to answer.
The Financial Stakes
| Metric | Value |
|---|---|
| Total value secured by smart contracts | $100B+ |
| Average DeFi hack (2024) | $50M+ |
| Code audit backlog | Months of waiting |
What is EVMbench?
EVMbench is an open evaluation framework designed to measure how well AI agents can:
- Detect vulnerabilities in smart contract codebases
- Patch vulnerable contracts while preserving functionality
- Exploit vulnerabilities in sandboxed environments
What Makes EVMbench Different?
Unlike synthetic benchmarks, EVMbench uses real vulnerabilities sourced from:
- Code4rena auditing competitions (majority)
- Tempo blockchain audit scenarios
- Historical smart contract exploits
By the Numbers
| Metric | Value |
|---|---|
| Total vulnerabilities | 120 |
| Source audits | 40 |
| Evaluation modes | 3 |
| Chain supported | EVM-compatible |
The Three Evaluation Modes
1. Detect Mode
Agents audit a smart contract repository and are scored on their ability to recall ground-truth vulnerabilities and associated audit rewards.
How It Works
- Agent receives complete codebase
- Agent identifies security issues
- Score based on recall of known vulnerabilities
Key Insight
“Agents sometimes stop after identifying a single issue rather than exhaustively auditing the codebase.”
2. Patch Mode
Agents modify vulnerable contracts to eliminate exploitability while preserving intended functionality.
How It Works
- Agent receives vulnerable contract
- Agent implements fix
- Automated tests verify: (a) functionality preserved, (b) vulnerability eliminated
Key Insight
“Maintaining full functionality while removing subtle vulnerabilities remains challenging.”
3. Exploit Mode
Agents execute end-to-end fund-draining attacks against deployed contracts in a sandboxed blockchain environment.
How It Works
- Contract deployed to local Anvil instance
- Agent writes and executes exploit code
- Grading performed via transaction replay and on-chain verification
Key Insight
“The objective is explicit: continue iterating until funds are drained.”
Performance Results
Current Model Rankings
| Model | Exploit Mode | Release Date |
|---|---|---|
| GPT-5.3-Codex | 72.2% | 2026 |
| GPT-5 | 31.9% | Mid-2025 |
Performance Analysis
Exploit Mode: Strongest Performance
AI agents excel when the objective is clear and measurable. Exploit mode provides explicit feedback—either funds are drained or they aren’t—making it easier for agents to iterate toward success.
Detect Mode: Room for Improvement
Agents struggle with exhaustive auditing. Common issues:
- Stopping after finding one vulnerability
- Missing edge cases in complex contracts
- Overlooking low-severity issues that compound
Patch Mode: The Hardest Challenge
Balancing security fixes with functional preservation is difficult. Agents often:
- Introduce breaking changes
- Over-patch, breaking legitimate functionality
- Miss subtle logic vulnerabilities
Technical Implementation
Evaluation Framework
OpenAI developed a Rust-based harness that:
- Deploys contracts deterministically
- Replays agent transactions
- Restricts unsafe RPC methods
- Provides reproducible results
Sandbox Environment
// Exploit mode runs in isolated environment Environment: - Local Anvil instance (not mainnet) - Historical vulnerabilities only - Single-chain support - Mock contracts for some scenarios
Task Creation Process
- Adapt existing PoC exploit tests
- Manually write scenarios where no PoC exists
- Ensure vulnerabilities are exploitable
- Verify patches don’t break compilation
- Red-team environments to prevent cheating
Limitations and Scope
What’s Included
- 120 curated vulnerabilities from real audits
- Historical, publicly documented issues
- Payment-oriented smart contract scenarios (Tempo)
- Automated grading infrastructure
What’s Excluded
- Mainnet forks (clean Anvil instance only)
- Multi-chain environments
- Timing-dependent behaviors
- Zero-day vulnerabilities
- Human-discovered issues not in competitions
Grading Limitations
- Detect mode: Only scores known vulnerabilities; agent-found additional issues unverified
- Exploit mode: Sequential transaction replay; no parallel execution scenarios
- Patch mode: May miss subtle functional regressions
Why This Matters for Developers
AI-Assisted Auditing is Coming
EVMbench signals that AI agents will increasingly participate in smart contract security. For developers, this means:
Opportunities
- Faster initial vulnerability discovery
- Automated regression testing
- Second-pass auditing before human review
- Continuous security monitoring
Challenges
- Understanding AI limitations
- Validating AI-generated fixes
- Integrating AI tools into workflows
- Maintaining security expertise
The Dual-Use Reality
AI that can audit can also exploit. EVMbench results show:
- 72.2% exploit success is high enough to matter
- Defensive use must outpace offensive capability
- Monitoring and safeguards are essential
Practical Implications
For Security Researchers
- EVMbench provides standardized evaluation
- Compare AI tools objectively
- Identify capability gaps
- Guide R&D priorities
For Protocol Teams
- AI can supplement human audits
- Use for pre-audit scanning
- Automate regression testing
- Validate patches before deployment
For AI Developers
- Clear metrics for improvement
- Open-source tooling available
- Community benchmark participation
- Real-world impact assessment
The Path Forward
What’s Needed
| Improvement Area | Current State | Goal State |
|---|---|---|
| Exhaustive detection | Misses issues | Complete codebase coverage |
| Safe patching | Sometimes breaks functionalityAlways preserve behavior | |
| Multi-chain | EVM only | Cross-chain security |
| Timing attacks | Out of scope | Full execution models |
OpenAI’s Commitments
- Release EVMbench tasks and tooling
- $10M in API credits for cyber defense
- Aardvark security research agent (private beta)
- Free codebase scanning for widely used projects
Getting Started with EVMbench
For Researchers
# Clone the evaluation framework git clone https://github.com/openai/evmbench cd evmbench # Install dependencies cargo install evmbench-harness # Run evaluation evmbench evaluate --model gpt-5.3-codex --mode exploit
For Developers
- Review EVMbench paper (PDF available)
- Test your contracts against evaluation tasks
- Submit improvements to the benchmark
- Apply for API credits via Cyber Security Grant Program
Conclusion
EVMbench represents a significant step forward in measuring AI capabilities for smart contract security. With GPT-5.3-Codex achieving 72.2% exploit success and clear improvement trajectories, the writing is on the wall: AI will play an increasingly important role in keeping DeFi secure.
For developers, the message is clear: embrace AI-assisted security tools, but don’t abandon human expertise. The most robust security posture combines AI’s speed with human judgment.
The benchmark is open-source and ready for community contribution. Whether you’re building AI models, developing smart contracts, or securing DeFi protocols, EVMbench offers a common framework for measuring progress and identifying gaps.
The future of smart contract security isn’t human versus AI—it’s human and AI working together to build more secure systems.
Resources
- EVMbench Paper: Available on OpenAI’s CDN (PDF)
- GitHub Repository: github.com/openai/evmbench
- Paradigm Announcement: paradigm.xyz/2026/02/evmbench
- Cyber Security Grant Program: Apply via OpenAI
