Skip to main content

State Sync

Find out about the Cosmos SDK's support for Tendermint Core state sync.

note

Note: Interested in the network sync process alone? Immediately proceed to this part.

Tendermint Core State Sync

Instead of retrieving and replaying all history blocks, state sync enables a new node to join a network by fetching a snapshot of the network state at a recent height. This reduces the time to sync with the network from days to minutes because application state is smaller than the sum of all blocks and restoring state is faster than replaying blocks.

A brief description of the Tendermint state sync protocol and instructions on how to sync a node are given in this portion of the document. Consult the ABCI Application Guide and the ABCI Reference Documentation for more information.

State Sync Snapshots

Giving apps the most freedom possible was a design principle for Tendermint state sync. Tendermint does not care what data is contained in snapshots, how snapshots are taken, or how snapshots are restored. Only finding existing snapshots on the network, retrieving them, and delivering them to programs via ABCI are of any interest to it. Any additional verification must be performed by the program itself during restoration. Tendermint employs light client verification to compare the final app hash of a restored application to the chain app hash.

Snapshots are made up of binary chunks of any format. There are no limits outside the size limit of 16 MB for chunks. The following fields are included in the Snapshot metadata that is transferred over ABCI and P2P:

  • height (uint64): height at which the snapshot was taken
  • format (uint32): arbitrary application-specific format identifier (eg. version)
  • chunks (uint32): number of binary chunks in the snapshot
  • hash (bytes): arbitrary snapshot hash for comparing snapshots across nodes
  • metadata (bytes): arbitrary binary snapshot metadata for use by applications

The format field in a snapshot allows applications to modify the format of their snapshots in a backwards-compatible way. This is done by providing snapshots in multiple formats and specifying which formats to accept during restoration. This feature is useful when changing serialization or compression formats, as it allows nodes to provide snapshots to peers running older versions or make use of old snapshots when starting up with a newer version.

The hash field contains a snapshot hash that is not verified by Tendermint. Instead, it is left up to the application to verify the hash if desired. Despite this, snapshots with identical metadata fields, including the hash, are considered identical and chuncks can be fetched from any node. This helps prevent inadvertent nondeterminism in snapshot generation.

The metadata parameter is open to any arbitrary metadata that the application requires. As an illustration, the application might want to incorporate chunk checksums to weed out corrupt chunks or Merkle proofs to independently verify each chunk against the chain app hash. Snapshot metadata messages in Protobuf-encoded form are limited to 4 MB.

Taking, Serving Snapshots

Some network nodes must capture and serve snapshots in order to facilitate state sync. An existing Tendermint node will call the ABCI methods listed below on the application to deliver snapshot data to this peer when it tries to state sync:

Since snapshot generation can be slow, it is usually preferable to generate them on a regular schedule to optimize state synchronization performance and prevent a denial-of-service attack vector where an adversary repeatedly sends requests to a node. The two most current snapshots should be kept at all times to avoid deleting the older ones while a node is recovering the more recent ones. Older snapshots can normally be deleted.

The method for taking snapshots is largely up to the program, however it should make an effort to adhere to the following guarantees:

  • Asynchronous: Snapshotting should not interfere with the processing of blocks, and thus should be performed asynchronously, for instance, in a separate thread.
  • Consistent: Snapshots should be taken at isolated heights, and their integrity should not be compromised by concurrent writes, such as those resulting from block processing in the main thread.
  • Deterministic: The chunks and metadata components of a snapshot should be identical (at the byte level) across all nodes for a given height and format, to ensure optimal availability of chunks.

To implement a snapshot system, one possible approach is to follow these steps:

  1. Utilize a transactional data store that supports snapshot isolation, such as RocksDB or BadgerDB, to ensure consistency and integrity of the data.
  2. Following the commitment of a block, initiate a read-only database transaction in the main thread.
  3. Pass the transaction handle to a newly created thread to ensure that the snapshot process can operate independently of the main thread.
  4. Iterate through all data items in a deterministic order, such as sorted by key, to ensure that the snapshot is comprehensive and consistent.
  5. Serialize the data items using a protocol buffer (such as Protobuf) to ensure efficient and reliable data transmission.
  6. Hash the resulting byte stream and split it into fixed-size chunks (e.g., 10 MB) to facilitate efficient storage and retrieval.
  7. Store the chunks in the file system as separate files to ensure that the snapshot data is persistently stored.
  8. Record the snapshot metadata, including the byte stream hash, in a database or file to enable efficient retrieval and verification of the snapshot.
  9. Close the database transaction and exit the thread to ensure that the snapshot process is completed successfully.

Additional steps that applications may want to consider include compressing the data to reduce storage requirements, checksumming chunks to ensure data integrity, generating proofs for incremental verification, and removing old snapshots to maintain a tidy and up-to-date snapshot archive.

Restoring Snapshots

Upon initialization, Tendermint will verify whether the local node possesses any state data by checking if the LastBlockHeight attribute is set to 0. If no state data is present, Tendermint will initiate the discovery of snapshots through the P2P network. The discovered snapshots will be offered to the local application via the following ABCI calls:

The application can respond to discovered snapshots by accepting, rejecting, or rejecting the format, sender, or state sync, among other possible responses.

Once a snapshot is accepted, Tendermint will retrieve chunks from available peers and apply them sequentially to the application. The application can choose to accept, refetch, or reject the chunk, snapshot, sender, or abort state sync, among other possible responses.

After all chunks have been applied, Tendermint will invoke the Info ABCI method on the application and verify that the app hash and height correspond to the trusted values from the chain. It will then switch to fast sync to fetch any remaining blocks (if enabled), before finally joining normal consensus operation.

The restoration of snapshots is the responsibility of the application, and the process typically involves reversing the steps used to generate the snapshot. However, Tendermint only verifies snapshots after all chunks have been restored and does not reject any P2P peers on its own. As long as the trusted hash and application code are correct, it is not possible for an adversary to cause a state-synced node to have incorrect state when joining consensus. However, applications must take measures to counteract state sync denial-of-service attacks, such as implementing incremental verification, rejecting invalid peers, and so on.

It is important to note that state-synced nodes will have a truncated block history starting at the height of the restored snapshot. Furthermore, there is currently no backfill of all block data. Networks should consider the broader implications of this and may want to ensure that at least a few archive nodes retain a complete block history for auditability, backup, and other purposes.

Cosmos SDK State Sync

Application developers simply need to enable state sync in Cosmos SDK version 0.40+ in order to benefit from it. They won't have to set up the state sync protocol on Tendermint themselves as it is mentioned in the above section on Tendermint.

State Sync Snapshots

Tendermint Core manages the majority of the process of discovering, exchanging, and verifying state data for state synchronization. However, the application is responsible for taking periodic snapshots of its state and providing them to Tendermint via ABCI calls. Additionally, the application must be able to restore these snapshots when synchronizing a new node.

The Cosmos SDK stores application state in a data store called IAVL, with each module having its own IAVL stores. At regular height intervals, which are configurable, the Cosmos SDK will export the contents of each store at that height, Protobuf-encode and compress it, and save it to a snapshot store in the local filesystem. Since IAVL maintains historical versions of data, these snapshots can be generated simultaneously with new blocks being executed. These snapshots will then be retrieved by Tendermint via ABCI when a new node is state synchronizing.

It is important to note that only IAVL stores managed by the Cosmos SDK can be snapshotted. If the application stores additional data in external data stores, there is currently no mechanism to include these in state sync snapshots. As a result, the application cannot utilize automatic state synchronization via the SDK. However, the application is free to implement the state synchronization protocol itself, as described in the ABCI Documentation.

During the process of state synchronization, Tendermint will retrieve a snapshot from peers in the network and provide it to the local application, which will import it into its IAVL stores. Following this, Tendermint verifies the application's app hash against the main blockchain using light client verification and proceeds with the normal execution of blocks. It is important to note that a state-synced node will only restore the application state up to the height the snapshot was taken at, and does not contain historical data or historical blocks.

Enabling State Sync Snapshots

An application utilizing the CosmosSDK BaseApp must set up a snapshot store (with a database and filesystem directory), establish the snapshotting interval, and specify the amount of historical snapshots to maintain in order to allow state sync snapshots. Following is a simple example of this:

snapshotDir := filepath.Join(
cast.ToString(appOpts.Get(flags.FlagHome)), "data", "snapshots")
snapshotDB, err := sdk.NewLevelDB("metadata", snapshotDir)
if err != nil {
panic(err)
}
snapshotStore, err := snapshots.NewStore(snapshotDB, snapshotDir)
if err != nil {
panic(err)
}
app := baseapp.NewBaseApp(
"app", logger, db, txDecoder,
baseapp.SetSnapshotStore(snapshotStore),
baseapp.SetSnapshotInterval(cast.ToUint64(appOpts.Get(
server.FlagStateSyncSnapshotInterval))),
baseapp.SetSnapshotKeepRecent(cast.ToUint32(appOpts.Get(
server.FlagStateSyncSnapshotKeepRecent))),
)

The program should produce snapshots and print log messages when launched with the proper arguments, such as --state-sync.snapshot-interval 1000 --state-sync.snapshot-keep-recent 2:

Creating state snapshot    module=main height=3000
Completed state snapshot module=main height=3000 format=1

To avoid heights from being pruned while taking photographs, the snapshot interval must now be a multiple of the pruning-keep-every (defaults to 100). In general, it's a good practice to retain at least two recent snapshots on hand so that the prior snapshot won't be deleted while a node is trying to use it for state synchronization.

State Syncing a Node

tip

Are you looking for nodes to sync with for snapshots or archives? Have a look at this page.

Once a few nodes in a network have taken state synchronization snapshots, new nodes can join the network using state synchronization. To accomplish this, the node should first be configured normally, and the following pieces of information must be obtained for light client verification:

  • At least two RPC servers are available
  • A trusted height
  • The hash of the block ID at the trusted height

The trusted hash must come from a trusted source, such as a block explorer. However, the RPC servers do not need to be trusted. Tendermint will use the hash to obtain trusted app hashes from the blockchain in order to verify restored application snapshots. The app hash and corresponding height are the only pieces of information that can be trusted when restoring snapshots. Adversaries can forge everything else.

In this guide, we will use Ubuntu 20.04 as the operating system.

Prepare system

Update system

sudo apt update -y

Upgrade system

sudo apt upgrade -y

Install dependencies

sudo apt-get install ca-certificates curl gnupg lsb-release make gcc git jq wget -y

Install Go

wget -q -O - https://raw.githubusercontent.com/canha/golang-tools-install-script/master/goinstall.sh | bash
source ~/.bashrc

Set the node name

moniker="NODE_NAME"

Use commands below for Testnet setup

SNAP_RPC1="https://rpc.humans.nodestake.top"
SNAP_RPC="https://rpc-humansai.thenop.io:443"
CHAIN_ID="humans_4139-1"
PEER="5e51671241340f1d1e1409a9e0cc4474820bf782@humans-mainnet-peer.itrocket.net:17656"
wget -O $HOME/genesis.json https://raw.githubusercontent.com/humansdotai/mainnet/main/mainnet/1/genesis_1089-1.json

Use commands below for Mainnet setup

SNAP_RPC1="https://rpc.humans.nodestake.top"
SNAP_RPC="https://rpc.nodejumper.io/humans"
CHAIN_ID="humans_1089-1"
PEER="5e51671241340f1d1e1409a9e0cc4474820bf782@humans-mainnet-peer.itrocket.net:17656,2f8a0bf63e23606dc85bdd11afbf34e68a9f3b74@mainnet-humans.konsortech.xyz:40656"
wget -O $HOME/genesis.json https://raw.githubusercontent.com/humansdotai/testnets/main/genesis_4139-1.json

Install humansd

git clone https://github.com/humansdotai/humans.git && \ 
cd humans && \
make install

Configuration

Node init

humansd init $moniker --chain-id $CHAIN_ID

Move genesis file to .humansd/config folder

mv $HOME/genesis.json ~/.humansd/config/

Reset the node

humansd tendermint unsafe-reset-all --home $HOME/.humansd

Change config files (set the node name, add persistent peers, set indexer = "null")

sed -i -e "s%^moniker *=.*%moniker = \"$moniker\"%; " $HOME/.humansd/config/config.toml
sed -i -e "s%^indexer *=.*%indexer = \"null\"%; " $HOME/.humansd/config/config.toml
sed -i -e "s%^persistent_peers *=.*%persistent_peers = \"$PEER\"%; " $HOME/.humansd/config/config.toml

Set the variables for start from snapshot

note

Note: On other Cosmos chains, the user is typically told to use the most recent height minus 2000 as their trusted height. The Humans.ai chain takes a while to generate a snapshot, therefore the difference between the snapshot height and the most recent height is usually greater than 2000. Consequently, a context timeout error occurs:

5:33PM ERR error on light block request from witness, removing... error="post failed: Post \"http://humans-mainnet-state-sync-eu-01.net:26657\": context deadline exceeded" module=light primary={} server=node

To avoid this issue, simply pick a trusted height close to snapshot height. For example, if you know there's a snapshot at height 13286000, consider a trust hash from block 13284000.

You can get the latest snapshot height from here. Otherwise, your node will find the available snapshots. You will see logs similar to this:

5:48PM INF Discovered new snapshot format=2 hash="�z���Մ��^��Q\x1a\\I_�\x0f�OT!�(jM�$!��" height=13286000 module=statesync server=node
LATEST_HEIGHT=$(curl -s $SNAP_RPC/block | jq -r .result.block.header.height); \
BLOCK_HEIGHT=$((LATEST_HEIGHT - 40000)); \
TRUST_HASH=$(curl -s "$SNAP_RPC/block?height=$BLOCK_HEIGHT" | jq -r .result.block_id.hash)

Check

echo $LATEST_HEIGHT $BLOCK_HEIGHT $TRUST_HASH

Output example (numbers will be different):

376080 374080 F0C78FD4AE4DB5E76A298206AE3C602FF30668C521D753BB7C435771AEA47189

If output is OK do next

sed -i.bak -E "s|^(enable[[:space:]]+=[[:space:]]+).*$|\1true| ; \

s|^(rpc_servers[[:space:]]+=[[:space:]]+).*$|\1\"$SNAP_RPC,$SNAP_RPC1\"| ; \

s|^(trust_height[[:space:]]+=[[:space:]]+).*$|\1$BLOCK_HEIGHT| ; \

s|^(trust_hash[[:space:]]+=[[:space:]]+).*$|\1\"$TRUST_HASH\"| ; \

s|^(seeds[[:space:]]+=[[:space:]]+).*$|\1\"\"|" ~/.humansd/config/config.toml

Create humansd service

echo "[Unit]
Description=Humansd Node
After=network.target
#
[Service]
User=$USER
Type=simple
ExecStart=$(which humansd) start
Restart=on-failure
LimitNOFILE=65535
#
[Install]
WantedBy=multi-user.target" > $HOME/humansd.service; sudo mv $HOME/humansd.service /etc/systemd/system/
sudo systemctl enable humansd.service && sudo systemctl daemon-reload

Run humansd

systemctl start humansd

Check logs

journalctl -u humansd -f

When the node is started it will then attempt to find a state sync snapshot in the network, and restore it:

Started node                   module=main nodeInfo="..."
Discovering snapshots for 20s
Discovered new snapshot height=3000 format=1 hash=0F14A473
Discovered new snapshot height=2000 format=1 hash=C6209AF7
Offering snapshot to ABCI app height=3000 format=1 hash=0F14A473
Snapshot accepted, restoring height=3000 format=1 hash=0F14A473
Fetching snapshot chunk height=3000 format=1 chunk=0 total=3
Fetching snapshot chunk height=3000 format=1 chunk=1 total=3
Fetching snapshot chunk height=3000 format=1 chunk=2 total=3
Applied snapshot chunk height=3000 format=1 chunk=0 total=3
Applied snapshot chunk height=3000 format=1 chunk=1 total=3
Applied snapshot chunk height=3000 format=1 chunk=2 total=3
Verified ABCI app height=3000 appHash=F7D66BC9
Snapshot restored height=3000 format=1 hash=0F14A473
Executed block height=3001 validTxs=16 invalidTxs=0
Committed state height=3001 txs=16 appHash=0FDBB0D5F
Executed block height=3002 validTxs=25 invalidTxs=0
Committed state height=3002 txs=25 appHash=40D12E4B3

The node has joined the network and is presently in state synchronization:

Use this command to switch off your State Sync mode, after node fully synced to avoid problems in future node restarts!

sed -i.bak -E "s|^(enable[[:space:]]+=[[:space:]]+).*$|\1false|" $HOME/.humansd/config/config.toml
note

Note: The information in this document is sourced from Erik Grinaker, specifically his state sync guides for Tendermint Core and the Cosmos SDK.