Recovery from S3

Overall, WeSQL employs a combination of full snapshots and incremental binlogs to ensure data durability, enabling a new stateless Data Node to fully recover data from S3 without any loss.

The Data Node periodically generates snapshots and uploads them to S3, along with continuous uploads of the binlog. In the event of a Data Node failure, the system can recover by downloading the latest snapshot and replaying the binlogs from S3, ensuring that all data is restored up to the failure point.

Snapshot

The Snapshots stored in S3 allowing the system to restore all data to a specific point-in-time during initialization or in the event of a failure. The frequency of snapshot generation can be configured using the snapshot_archive_period parameter in the configuration file.

A SmartEngine snapshot includes the following components:

SmartEngine metadata: This includes the checkpoint and a few manifest files, capturing the engine's internal state, such as the structure of the LSM tree.
SmartEngine's latest WAL. Data written to SmartEngine first resides in the MemTable and may not yet be persisted to S3 as extent objects. During recovery, this data is reconstructed by replaying the WAL to restore any unflushed changes.
System tables: Schema information for databases, tables, and indexes. Since SeEngine's Data Dictionary is stored in InnoDB tables, periodic snapshots of these tables are generated using the InnoDB Clone plugin.
Binlog position: The snapshot also records the current binlog position at the time of the snapshot. During recovery, the system replays binlog entries starting from this position to bring the system up to date.

Binlog

WeSQL nodes replicate binlog entries using the Raft protocol, where the WeSQL Data Node functions as the Leader. During a write transaction, prior to committing, the Data Node sends the corresponding binlog entries to WeSQL Logger Nodes. The Logger Node appends these entries to binlog files stored on EBS and acknowledges back to the Data Node. The transaction is committed only once a majority of nodes in the Raft group have successfully persisted the binlog entries.

Binlogs are append-only. When an EBS-stored binlog file reaches a predefined size (4MB), WeSQL rotates the file and uploads the sealed binlog to S3.

To ensure timely binlog uploads, WeSQL enforces periodic binlog rotation. For example, if more than 1 second has passed, WeSQL forces the current binlog to be sealed and a new one created, even if the file size limit hasn’t been reached. The sealed binlog is then uploaded to S3.

In the event of a Data Node failure, one of the Logger Nodes is elected as the new Leader. The new Leader ensures that any unuploaded binlog entries are written to S3, maintaining complete and consistent data in S3.

OverAll Recovery Process

During Data Node recovery, if the node starts from an empty EBS (e.g., after a full disk failure or a fresh node setup), it first downloads the latest snapshot from S3. The snapshot contains the binlog offset (i.e. position of the binlog entry recorded in the snapshot). The Data Node then retrieves binlog entries from S3 starting from the recorded offset and replays them to restore the data.

However, if the Data Node is simply restarted and the EBS is still intact, recovery operates just like a local database recovery. In this case, there is no need to download the snapshot from S3. The Data Node will replay the local SmartEngine WAL logs and use the binlog to determine whether transactions were committed. If a transaction was not committed, it will be rolled back.

Snapshot​

Binlog​

OverAll Recovery Process​

Snapshot

Binlog

OverAll Recovery Process