The winding road to functional light clients - part 2

You should read part 1 first.

It's 2021. Ethereum went live a little more than five years ago. There's currently no reliable lightweight way to interact with the protocol without using a centralized provider. Various research efforts have demonstrated that the functionality can be built as long as you have access to the necessary data. In this post, we'll cover what the chain history is and what problems need to be solved to make it available in a suitable manner for a functional light client.

Chain history

The term "chain history" refers to all of the historical headers, blocks, and receipts. In the current ETH protocol, which all Ethereum clients use to communicate, there are message pairs that nodes can use to request this information from each other:

The full specification of the ETH protocol can be found here.

The chain history is a relatively simple data set to work with. It can be modeled as an append-only file with each new block adding a header and the corresponding transactions, uncles, and receipts.

The ETH protocol is optimized for new clients joining the network to be able to efficiently retrieve the full history of the chain. Once a client is fully synced, it no longer has any need for this data beyond serving JSON-RPC requests. They also have no ongoing need to request this data, as they receive new blocks and headers via separate gossip messages. Receipts are generated locally as part of block execution. Still, the network makes it compulsory for clients to hold the full history.

So, the network contains all of the data we need, but it is not architected for our use case. Trying to build a light client on this protocol exposes three main issues:

  • A: We need a concise view of the canonical chain.
  • B: We need an index to lookup blocks by number and transactions by hash.
  • C: We need to reduce the total data stored by individual nodes.

A: Canonical Header Chain

The trustless way to know the tip of the chain is to build the full chain of headers from genesis. This requires fetching ~11 million headers as well as ~6GB to store them.

Without having the full chain of headers, clients cannot discern between a header that is part of the canonical chain, and one that was uncled.  6GB of storage is too expensive for low resource devices like phones or a raspberry pi.  Fetching 11 million headers before being able to make your first request also violates the requirement of the client not having to sync.

Luckily, we can solve this one by borrowing a mechanism from the beacon chain. A "double-batched merkle log accumulator" added to the protocol would provide us with a simple and well understood mechanism to provide proofs about historical header inclusion. Clients would only need to maintain an accurate picture of the most recent headers, while historical headers can be "proven" part of the canonical chain with a simple merkle proof against the accumulator.

Getting around the sync problem is merely a UX issue that can be solved at the client level, either by allowing "checkpoint" headers, or simply trusting the observed tip of the chain and performing the actual work to fetch it and verify it asynchronously. It is reasonably straightforward to expose these as feature flags, allowing users to make their own decisions on security vs. convenience trade-offs.

B: Data Indexing

There are a few RPC endpoints that cannot be easily built directly on top of the existing network primatives. Clients are currently able to serve these endpoints by building an index over the chain history. The primary problematic endpoints are:

  • eth_getBlockByNumber
  • eth_getTransactionByHash

For the eth_getBlockByNumber endpoint, the complexity arises from uncles. There are a potentially infinite number of valid blocks at any given height, but there is only one that is part of the canonical chain. For this reason, as a client pieces together the canonical chain, they also build their own index that maps block_number -> block_hash. When a request comes in over JSON-RPC for a block by number, this index is used to translate that into a request for the block by hash.

Roughly the same problem exists for eth_getTransactionByHash. Taking uncles into account, transaction can exist in multiple different blocks, but may only be present in a single block when referring to the canonical chain. As clients process the canonical chain, they build an index that maps transaction_hash -> (block_hash, transaction_index). When a request comes in for a transaction, this index is used to transform the request to allow looking up the transaction along with the canonical block in which it was mined. Both the transaction and block are necessary to construct the JSON-RPC response.

So we need a mechanism for the network to expose these indices.  

The header accumulator provides us with a mechanism to to make the index data provably part of the canonical chain.

C: Reducing individual storage requirements

Nodes in the DevP2P ETH network are expected to be able to answer any request for any piece of chain history. Requests might come in for very recent blocks, very old blocks, or anything in-between. The network has no mechanism to allow a node to only store a subset of this data. The network fundamentally depends on the assumption that all nodes have all data. The network itself has no way to enforce this, however clients themselves will tend to disconnect from a peer that fails to respond with the requested data. This provides some security, as clients that aren't able to respond to requests are unlikely to maintain a healthy pool of peer connections.

So the first problem that needs to be solved is to have a mechanism that allows individual nodes to store a subset of the data and for the network to expose a mechanism for nodes to quickly find the nodes that have the data they need.

This turns out to be an emergent property of Kademlia DHT networks. The network topology itself provides a system for O(log(N)) lookups across an arbitrarily large set of keys. Assuming we use things like block hashes and transaction hashes, the data can be looked up from the DHT, verified against the corresponding header, and proven part of the canonical chain using the header accumulator.

The Alexandria DHT

"Alexandria" is the working name for a Kademlia DHT network that aims to provide on demand access to the chain history. The network itself is built on top of the Discovery v5 protocol that acts as a foundation for both the Beacon chain and the mainnet Ethereum network. This means there are implementations available in most, if not all of the languages that clients are written in.

Modification of the core protocol to add the header accumulator isn't strictly required, but it does greatly improve the situation. Even without Alexandria, the core protocol would be better off with this accumulator.

There are also scaling issues that need to be resolved.  The canonical header index which maps block numbers to their canonical hash is relatively small, only containing one entry for each header. The transaction index however is huge. It contains nearly a billion entries, one for every transaction that has ever been sent. In comparison, the BitTorrent DHT, which is very widely used, contains roughly 26 million different torrents. The Ethereum Mainnet would need to store roughly 50x as many "things" in our DHT.

Building and specifying the Alexandria DHT is an ongoing research topic.  We have a very rough in-progress specification and a proof of concept client. Development is ongoing and we'll continue to publish updates on this blog.


In the next post in this series, we'll cover the problem of making the Ethereum "state" available and how we are looking to solve it.


Onto part 3....