Skip to main content

Rebuild a WeSQL Cluster Data Node

In a scenario where a data node, including its local storage (e.g., EBS volumes), has been lost due to hardware failure or other issues, the entire data node can be rebuilt and re-added to the WeSQL cluster from the object store. This process allows you to restore the node without causing disruption to the rest of the cluster, ensuring continuity and consistency.

Step 1: Retrieve the Cluster Leader Node

To manage the cluster and perform operations on the node, you need to identify the current leader node. The leader node coordinates replication and maintains cluster consistency. All major changes to the cluster, such as adding or removing nodes, need to be executed through the leader.

Use the following SQL query to get the leader node information:

SELECT CURRENT_LEADER FROM INFORMATION_SCHEMA.WESQL_CLUSTER_LOCAL;

Look for the CURRENT_LEADER field in the result, which will provide the address of the leader node. This leader node will be used for executing commands to modify the cluster's configuration.

Step 2: Remove or Degrade the Data Node From the Cluster

Before you can remove the failed data node from the cluster, you must demote it from a follower role to a learner role. This step is necessary to safely remove the node without disrupting the raft protocol and replication protocol.

Execute the following commands on the leader node to downgrade the node and remove it as a learner:

CALL dbms_consensus.downgrade_follower('192.168.0.2:13006');
CALL dbms_consensus.drop_learner('192.168.0.2:13006');

  • downgrade_follower: This command reduces the node’s role to a learner, which does not participate in raft protocol but can still replicate data.
  • drop_learner: This command fully removes the node from the cluster as a learner, making it safe to rebuild and re-add later.

:::note: If the node’s IP address remains unchanged during the rebuilding process, you may skip the drop_learner step, as the node's identity in the cluster configuration remains intact. In that case, you will only need to upgrade the node after starting it. :::

Step 3: Prepare the my.cnf configuration file

To recover the data node, recreate the configuration file to match the settings used when the Data Node was initially set up.

Now, create and edit the my.cnf file:

mkdir -p /u01/mysql_data_leader
vim /u01/mysql_data_leader/my.cnf

Add the following content:

[mysqld]
# binlog
sync_binlog=1
sync_relay_log=1
log_bin=master-bin
log_bin_index=master-bin.index

# Raft settings
raft_replication_auto_leader_transfer=ON

# serverless settings
objectstore_provider=aws
objectstore_region=us-west-1
objectstore_bucket=wesql-storage
repo_objectstore_id=sysbench
branch_objectstore_id=main

# server
port=3006
datadir=/u01/mysql_data_leader
tmpdir=/u01/mysql_data_leader_tmp
socket=/u01/mysql_data_leader_tmp/mysqld.sock
pid-file=/u01/mysql_data_leader_run/mysqld.pid
log-error=/u01/mysql_data_leader_log/mysqld.err

Step 4: Start the Data Node as a Learner

Once the node has been removed (or demoted), you can now start it as a learner with its storage restored from the object store. This will allow the node to sync its data back from the cluster and participate in replication.

You need to configure the new instance of mysqld with the correct data directory and object store parameters, including region and bucket information where the snapshots are stored.

Use the following command to start the data node:

mysqld \
--defaults-file=/u01/mysql_data_leader/my.cnf \
--raft-replication-force-change-meta=ON \
--raft-replication-cluster-info='192.168.0.2:13006'
  • raft-replication-force-change-meta: Forces a metadata refresh in the raft replication protocol, ensuring the node updates its metadata to align with the cluster.
  • raft-replication-cluster-info: Contains the necessary information for the node to join the cluster, such as the IP address and port.

Step 4: Rejoin the Data Node to the Cluster

Once the data node has successfully started and is running as a learner, it can be rejoined to the cluster. The learner role allows the node to sync data from the cluster without participating in raft protocol immediately.

Execute the following commands to add and upgrade the node:

CALL dbms_consensus.add_learner('192.168.0.2:13006');
CALL dbms_consensus.upgrade_learner('192.168.0.2:13006');
  • add_learner: Adds the node to the cluster as a learner, allowing it to start syncing data without participating in raft protocol.
  • upgrade_learner: Promotes the node from learner to follower, enabling it to fully participate in raft and replication.

:::note: If the IP address of the node remains unchanged and you skipped the drop_learner step, simply upgrade the node using the upgrade_learner command.

After completing these steps, the data node will be fully restored and reintegrated into the cluster. Its data will be recovered from the object store, and it will synchronize with the rest of the cluster. :::