Why S3?
The first decision to make when developing a serverless MySQL running on the cloud is choosing the ideal storage solution.
Evaluating Storage Options for Serverless MySQL in the Cloud
We evaluated several options to find the best fit:
-
EBS
Although EBS is called "elastic," as of 2024, it doesn't fully live up to the term:
- Elasticity limitations: Cloud providers impose restrictions on EBS expansion API calls. For example, with AWS, after resizing an EBS volume, you must wait 6 hours before making another expansion request.
- Attachment delays: Detaching an EBS volume from one EC2 instance and attaching it to another is not covered by any SLA, and the time can vary from a few seconds to several minutes.
- No cross-AZ support: EBS volumes cannot be mounted across AZs. They can only be attached to EC2 instances within the same AZ.
Besides, EBS is expensive, and its performance is not ideal. For instance, gp3 volumes offer only 3000 IOPS and 125MB/s of free bandwidth, which can become a bottleneck during the database dirty pages flushes. and if you exceed this, you'll incur additional charges for extra IOPS and bandwidth.
-
Instance store
Instance Store offers high-performance and more affordable local SSD on EC2 instances but comes with a major limitation due to its ephemeral nature. If the EC2 instance is stopped or terminated, all data stored locally is lost, making it unsuitable for most database use cases. However, its excellent performance makes it an ideal choice for caching.
-
Building a dedicated distributed storage
Building a dedicated distributed storage system (page servers) is an attractive option. In theory, it allows us to implement any desired feature or capability. For example, AWS Aurora has its own custom-built multi-tenant distributed storage system, where multiple users' databases are stored in a single Aurora storage cluster.
While distributed storage is well-suited for fully managed database services, it’s not ideal for users who prefer BYOC (deploying in their own VPC):
- Adding a distributed storage system introduces significant operational and maintenance overhead. Requiring MySQL users to set up and manage a distributed storage system is far from appealing.
- Moreover, distributed storage comes with upfront costs that aren't pay-as-you-go. Even for a small 10GB database, users are forced to deploy a multi-node storage system that can handle several terabytes, likely deterring many potential users.
After eliminating these options, few choices remain, and the most suitable storage option becomes clear: S3.
Using S3 as disk: Benefits and Challenges
Storing data in S3 offers several immediate advantages:
- Unlimited data capacity: S3 buckets have no storage limits.
- Extremely high data reliability: S3 guarantees 99.999999999% (11 nines) durability, eliminating the need for multiple replicas to prevent data loss.
- High throughput: S3 can handle tens of gigabits per second of data transfer, which is sufficient for database dirty page flushes and loading data during recovery.
- Region-level disaster recovery capabilities: S3 supports region-level recovery. In case of an AZ failure, you can quickly restore replicas from S3 in another AZ.
- Significantly reduced storage costs: AWS S3 Standard costs $0.023 per GB per month, while EBS gp3 costs $0.08 per GB per month. Additionally, EBS often requires 2-3 replicas, making it 7-10x more expensive than S3.
However, two major challenges also arise immediately:
- High latency: S3 is optimized for throughput, not low-latency access. Storing and retrieving data from S3 typically incurs higher latency compared to block storage like EBS. For example, users often expect a database write operation to return within a few milliseconds, whereas S3 write latency is around hundreds of milliseconds. This represents a significant gap in performance expectations.
- No support for random writes: S3 is an object storage system, meaning it does not support random writes — data must be written as a whole object. This can be problematic for databases where frequent small updates and random I/O operations are common, as S3 requires overwriting entire objects during such operations.
How WeSQL Addresses These Challenges
To address these challenges, we adhere to the following design principles:
Design Principle 1: Separate the concerns of Persistence and Latency
One of the key challenges in designing a cloud-native database like WeSQL is the inherent high write latency of object storage systems such as S3, which can increase the latency of write transactions to unacceptable levels — for example, even writing a single record might require waiting several hundred milliseconds. A critical design principle to mitigate this issue is to separate the concerns of persistence and latency-critical operations.
Instead of writing every operation directly to S3, the system can use local storage (e.g., EBS or EC2 instance store) for latency-sensitive operations. For example, binlog writes, which need to be quick and reliable, can be stored locally in low-latency storage. The binlog can then be asynchronously uploaded to S3 in the background, ensuring that the system benefits from both quick local write operations and long-term, cost-effective cloud storage.
To ensure durability and fault tolerance, the binlog can be replicated across multiple nodes using a Raft algorithm like Raft or Paxos. This way, even if a minority of nodes experience failures, such as EBS corruption or data loss, the binlog will not be lost.
This approach allows for low-latency writes, while maintaining strong consistency and persistence guarantees, ensuring that S3 latency does not impact the critical path of write operations.
Design Principle 2: Log-structed data structure is a first-class citizen on S3
Adopting a log-structured data structure is central to optimizing performance in systems that use object storage like S3. S3 is optimized for sequential writes and bulk operations, while it performs poorly with random, small writes. A log-structured data structure is highly compatible with this, as it converts random write patterns into sequential writes, aligning well with S3's append-only nature.
A log-structured merge-tree (LSM) is particularly well-suited to such environments:
- In-Memory Buffering: Updates are first written to an in-memory buffer (e.g., a MemTable). This allows for very quick writes, as data is initially stored in memory.
- Batch Flushes: Once the in-memory buffer reaches a certain threshold, it is flushed in bulk to S3, ensuring that writes are sequential and batched. This minimizes the performance overhead of S3's high latency for small, random writes.
- Compaction: Over time, LSM trees perform compaction, merging older data from different levels of the tree. This compaction process also benefits from S3's sequential write performance, as large chunks of data are written in bulk.
- Efficient Reads: While S3 handles sequential writes well, reads can be optimized with multi-tier caches like an in-memory row/block cache and SSD-based block cache, minimizing the need for frequent access to S3 and improving performance.
By treating LSM as a primary data structure on S3, WeSQL ensures that it can take full advantage of the cloud's scalability and cost efficiency while overcoming the performance limitations of random I/O operations.