EIP2926 - Chunk-Based Code Merkleization

# Abstract

Code merkleization, along with binarification of the trie and gas cost bump of state accessing opcodes, are considered as the main levers for decreasing block witness sizes in stateless or partial-stateless Eth1x roadmaps. Here we specify a fixed-sized chunk approach to code merkleization and outline how the transition of existing contracts to this model would look like.

# Motivation

Bytecode is currently the second contributor (opens new window) to block witness size, after the proof hashes. Transitioning the trie from hexary to binary reduces the hash section of the witness by 3x, thereby making code the first contributor. By breaking contract code into chunks and committing to those chunks in a merkle tree, stateless clients would only need the chunks that were touched during a given transaction to execute it.

# Specification

This specification assumes that EIP-2584 is deployed, and the merkleization rules and gas costs are proposed accordingly. What follows is structured to have two sections:

How a given contract code is split into chunks and then merkleized
How to merkleize all existing contract codes during a hardfork

# Constants and Definitions

# Constants

CHUNK_SIZE: 32 (bytes)
KEY_LENGTH: 2 (bytes)
MAX_CHUNK_COUNT: 0xfffc
VERSION_KEY: 0xfffd
CODELEN_KEY: 0xfffe
CODEHASH_KEY: 0xffff
VERSION: 0
EMPTY_CODE_ROOT: 0xc5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470 (==keccak256(''))
HF_BLOCK_NUMBER: to be defined

# Definitions

BE(x, N): casts x to an unsigned integer of N bytes and returns its big-endian representation

# Code merkleization

For an account record A with code C, the field A.codeHash is replaced with codeRoot. codeRoot is EMPTY_CODE_ROOT if C is empty. Otherwise it contains the root of codeTrie, a binary trie (opens new window) with the following leaves:

Key: VERSION_KEY, value: BE(VERSION, 1)
Key: CODELEN_KEY, value: BE(length(C), 4)
Key: CODEHASH_KEY, value: keccak256(C)

In addition to the above, codeTrie commits to a list of code chunks = [(FIO_0, code_0), ..., (FIO_n, code_n)] which are derived from C in a way that:

n < MAX_CHUNK_COUNT.
code_0 || ... || code_n == C.
length(code_i) == CHUNK_SIZE where 0 <= i < n.
length(code_n) <= CHUNK_SIZE.
FIO_i is the offset of the first instruction within the chunk. It should only be greater than zero if the last instruction in code_i-1 is a multi-byte instruction (like PUSHN) crossing the chunk boundary. It is set to CHUNK_SIZE in the case where all bytes of a chunk are data.

The ith element of chunks is stored in codeTrie with:

Key: BE(i, KEY_LENGTH)
Value: BE(FIO_i, 1) || code_i, where || stands for byte-wise concatenation

# Contract creation gas cost

As of now there is a charge of 200 gas per byte of the code stored in state by contract creation operations, be it initiated via CREATE, CREATE2, or an external transaction. This per byte cost is to be increased from 200 to TBD to account for the chunking and merkleization costs.

# Updating existing code (transition process)

The transition process involves reading all contracts in the state and applying the above procedure to them. A benchmark showing how long this process will take is still pending, but intuitively it should take longer than the time between two blocks (in the order of hours). Hence we recommend clients to pre-process the changes before the EIP is activated.

Code has the nice property that it is (mostly) static. Therefore clients can maintain a mapping of [accountAddress -> codeRoot] which stores the results for the contracts they have already merkleized. During this pre-computation phase, whenever a new contract is created its codeRoot is computed, and added to the mapping. Whenever a contract self-destructs, its corresponding entry is removed.

At block HF_BLOCK_NUMBER when the EIP gets activated, before executing any transaction the client must update the account record for all contracts with non-empty code in the state to replace the codeHash field with the pre-computed codeRoot. EoA accounts will keep their codeHash value as codeRoot. Accounts with empty code will keep their codeHash value as codeRoot.

# Rationale

# Hexary vs binary trie

The Ethereum mainnet state is encoded as of now in a hexary Merkle Patricia Tree. As part of the Eth1x roadmap, a transition to a binary trie (opens new window) has been investigated (opens new window) with the goal of reducing witness sizes. Because code chunks are also stored in the trie, this EIP would benefit from the witness size reduction offered by the binarification. Therefore we have decided to explicitly state EIP-2584 to be a requirement of this change. Note that if code merkleization is included in a hard-fork beforehand, then all code must be re-merkleized after the binary transition.

# Chunk size

The current recommended chunk size of 32 bytes has been selected based on a few observations. Smaller chunks are more efficient (i.e. have higher chunk utilization (opens new window)), but incur a larger hash overhead (i.e. number of hashes as part of the proof) due to a higher trie depth. Larger chunks are less efficient, but incur less hash overhead. We plan to run a larger experiment comparing various chunk sizes to arrive at a final recommendation.

# First instruction offset

The firstInstructionOffset fields allows safe jumpdest analysis when a client doesn't have all the chunks, e.g. a stateless clients receiving block witnesses.

Note: there could be an edge case when computing FIO for the chunks where the data bytes at the end of a bytecode (last chunk) resemble a multi-byte instruction. This case can be safely ignored.

# Gas cost of code-accessing opcodes

How merkleized code is stored in the client database affects the performance of code-accessing opcodes, i.e: CALL, CALLCODE, DELEGATECALL, STATICCALL, EXTCODESIZE, EXTCODEHASH, and EXTCODECOPY. Storing the code trie with all intermediate nodes in the database implies multiple look-ups to fetch the code of the callee, which is more than the current one look-up (excluding the trie traversal to get the account) required. Note CODECOPY and CODESIZE are not affected since the code for the current contract has already been loaded to memory.

The gas cost analysis in this section assumes a specific way of storing it. In this approach clients only merkleize code once during creation to compute codeRoot, but then discard the chunks. They instead store the full bytecode as well as the metadata fields in the database. We believe per-chunk metering for calls would be more easily solvable by witness metering in the stateless model.

# Different chunking logic

We have considered an alternative option to package chunks, where each chunk is prepended with its chunkLength and would only contain complete opcodes (i.e. any multi-byte opcode not fitting the CHUNK_SIZE would be deferred to the next chunk).

This approach has downsides compared to the one specified:

Requires a larger CHUNK_SIZE -- at least 33 bytes to accommodate the PUSH32 instruction.
It is more wasteful. For example, DUP1 PUSH32 <32-byte payload> would be encoded as two chunks, the first chunk contains only DUP1, and the second contains only the PUSH32 instruction with its payload.
Calculating the number of chunks is not trivial and would have to be stored explicitly in the metadata.

Additionally we have reviewed many other options (basic block based, Solidity subroutines (requires determining the control flow), EIP-2315 subroutines). This EIP however only focuses on the chunk-based option.

# RLP and SSZ

To remain consistent with the binary transition proposal we avoid using RLP for serializing the leaf values. We have furthermore considered SSZ for both serializing data and merkleization and remain open to adopting it, but decided to use the binary trie format for simplicity.

# Metadata fields

The metadata fields version, codeLen and codeHash are added mostly to facilitate a cheaper implementation of EXTCODESIZE and EXTCODEHASH in a stateless paradigm. The version field allows for differentiating between bytecode types (e.g. EVM1.5/EIP-615, EIP-2315, etc.) or code merkleization schemes (or merkleization settings, such as larger CHUNK_SIZE) in future.

Instead of encoding codeHash and codeSize in the metadata, they could be made part of the account. In our opinion, the metadata is a more concise option, because EoAs do not need these fields, resulting in either additional logic (to omit those fields in the accounts) or calculation (to include them in merkleizing the account).

An alternative option to the version field would be to add an account-level field: either following EIP-1702, or by adding an accountKind field (with potential options: eoa, merkleized_evm_chunk32, merkleized_eip2315_chunk64, etc.) as the first member of the account. One benefit this could provide is omitting codeHash for EoAs.

# The keys in the code trie (and `KEY_LENGTH`)

As explained in the specification above, the keys in the code trie are indices of the chunks array. Having a key length of 2 bytes means the trie can address 65536 - 3 (minus metadata fields) chunks, which corresponds to ~2Mb code size. That allows for roughly ~85x increase in the code size limit in future without requiring a change in merkleization.

# Alternate values of codeRoot for EoAs

This proposal changes the meaning of the fourth field (codeHash) of the account. Prior to this change, that field represents the Keccak-256 hash of the bytecode, which is logically hash of an empty input for EoAs.

Since codeHash is replaced with codeRoot, the root hash of the code trie, the value would be different for EoAs under the new rules: the root of the codeTrie(metadata=[codeHash=keccak256(''), codeSize=0]). An alternative would be simply using the hash of an empty trie. Or to avoid introducing yet another constant (the result of the above), one could also consider using codeRoot = 0 for EoAs.

However, we wanted to maintain compatibility with EIP-1052 and decided to keep matters simple by using the hash of empty input (c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470) for EoAs.

# Backwards Compatibility

From the perspective of contracts, the design aims to be transparent, with the exception of changes in gas costs.

Outside of the interface presented for contracts, this proposal introduces major changes to the way contract code is stored, and needs a substantial modification of the Ethereum state. Therefore it can only be introduced via a hard fork.

# Test Cases

TBD

Show the codeRoot for the following cases:

code=''
code='PUSH1(0) DUP1 REVERT'
code='PUSH32(-1)' (data passing through a chunk boundary)

# Implementation

The implementation of the chunking and merkleization logic in Typescript can be found here (opens new window), and in Python here (opens new window). Please note neither of these examples currently use a binary tree.

# Security Considerations

TBA