OpenAI and Paradigm Launch EVMbench for Ethereum Security

ETH1,69%
  • OpenAI and Paradigm built EVMbench from 120 real audit vulnerabilities.

  • Benchmark tests AI in detect, patch, and exploit modes using sandboxed EVM environments.

  • GPT-5.3-Codex scored 72.2% in exploit mode, outperforming earlier GPT-5 results.

OpenAI, working with Paradigm, unveiled a new benchmark to test AI performance on Ethereum smart contract security. The release, announced this week, introduced EVMbench as a way to measure how AI agents detect, patch, and exploit contract flaws. The effort targets rising risks, as smart contracts secure over $100 billion in crypto assets across EVM networks.

Benchmark Built From Real-World Audit Failures

According to OpenAI, EVMbench draws from 120 high-severity vulnerabilities identified across 40 professional smart contract audits. Notably, many of these issues originated from open audit competitions, including Code4rena. The benchmark focuses on real bugs rather than synthetic examples.

In addition, OpenAI said the dataset includes scenarios linked to security work on the Tempo chain. Tempo operates as a payment-focused Layer-1 network built for stablecoin transfers. Because of that, these cases introduce payment logic risks into the benchmark environment.

To support realistic testing, engineers reused exploit proof-of-concept scripts where available. However, they manually built missing components when documentation proved incomplete. OpenAI said it preserved exploitability while ensuring patches could compile correctly.

Three Testing Modes Stress AI Agents

EVMbench evaluates agents in detect, patch, and exploit modes. In detect mode, agents scan repositories and receive scores based on confirmed vulnerability recall. In patch mode, agents must fix flaws while preserving original contract behavior.

Exploit mode, however, simulates full fund-draining attacks within a sandbox blockchain. OpenAI said graders confirm outcomes through transaction replay and on-chain state checks. To ensure consistency, the company built a Rust-based harness for deterministic deployments.

The exploit tests run in a local Anvil environment, not live networks. OpenAI noted that all vulnerabilities are historical and publicly disclosed. Additionally, the harness restricts unsafe RPC calls to reduce misuse.

Results and Team Expansion

In reported results, GPT-5.3-Codex achieved a 72.2% score in exploit mode. By comparison, GPT-5 reached 31.9%, despite launching months earlier. However, OpenAI said detection and patch coverage remains incomplete.

Alongside EVMbench, OpenAI confirmed a key hire. Peter Steinberger, founder of OpenClaw, joined the company to work on agent development. Sam Altman confirmed the move on X, noting Steinberger will lead next-generation personal agent projects.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.

Related Articles

ETH 15-minute increase of 0.83%: Whales' capital inflow and DeFi lending demand resonate to drive the price

Between 13:30 and 13:45 (UTC) on March 11, 2026, ETH experienced a short-term fluctuation. The candlestick data shows a return of +0.83%, with a price range of 2046.07 to 2082.31 USDT, and an amplitude of 1.77%. Market activity increased during this period, with trading volume significantly higher than the previous cycle, and volatility exceeding the intraday average, attracting widespread market attention. The main drivers of this fluctuation were the rapid inflow of large on-chain funds and active institutional accounts. Between 13:32 and 13:43,

GateNews56m ago

A certain address deposited 28,970 ETH into a CEX, with a value of approximately 59.05 million US dollars.

Gate News Report, March 11 — According to Lookonchain monitoring, a Gnosis Safe Proxy multi-signature wallet address 0x23A5 just deposited 28,970 ETH into a CEX, worth approximately $59.05 million.

GateNews1h ago

USDC and CCTP officially launch on Ethereum L2 network Morph

USDC and the cross-chain transfer protocol CCTP are now live on the Ethereum L2 network Morph, supporting payments, remittances, and DeFi transactions. CCTP enables cross-chain transfer of USDC between Morph and other blockchains. The first batch of integrated partners includes a CEX, Bulba, and Stargate.

GateNews2h ago

Ethereum on-chain activity explodes: daily active addresses approach 2 million, smart contract calls exceed 40 million for a new high, but ETH drops 30% and transaction fees lose to Tron

CryptoQuant March Report indicates that Ethereum on-chain activity has reached a record high, but ETH prices have fallen 30% over the past six months, and transaction fee revenue has lagged behind other public chains. The report analyzes that the factors driving ETH prices have shifted from on-chain usage to capital flows, prompting the market to reconsider Ethereum's value proposition.

動區BlockTempo3h ago

The U.S. Department of Justice seizes $3.4 million USDT in connection with Ethereum investment fraud and money laundering

The U.S. Attorney's Office in Massachusetts has filed a civil forfeiture lawsuit seeking to recover $3.4 million in USDT related to cryptocurrency fraud and money laundering schemes. Victims were misled into investing in a fake Ethereum project, with their funds ultimately flowing to unidentified controllers.

GateNews3h ago
Comment
0/400
No comments