Columnar Format Table
SmartEngine supports two data formats: row-based and column-based. The key distinction lies in how the data is internally organized within the data block.
-
Row-based format: Data is stored as individual key-value pairs, with each row represented as a key-value entry.
-
Column-based format: Data blocks are organized into several column units, where each column unit contains all the values for a particular column across multiple rows.
DDL operations
SmartEngine supports creating tables with a columnar data format through DDL statements. It also allows switching between row-based and columnar formats seamlessly.
- Create table using columnar format.
CREATE TABLE t (c1 int, c2 int) ENGINE_ATTRIBUTE='{"data_format":"columnar"}';
- Alter table to use columnar format.
ALTER TABLE t ENGINE_ATTRIBUTE='{"data_format":"columnar"}';
- Alter table to use row format.
ALTER TABLE t ENGINE_ATTRIBUTE='{"data_format":"row"}';
This flexibility is made possible by extending engine attributes. Users can explicitly define the storage format (row or columnar) in DDL statements, allowing for optimized storage based on workload requirements. For example, columnar storage is ideal for append-heavy, long-term data (such as historical tables), offering higher compression and cost savings. Conversely, row-based storage is more suited for frequently accessed and updated data, delivering better performance.
The data format can be specified at table creation or modified later using the ALTER statement. This change is applied as an instant DDL, taking effect immediately without needing to rebuild or reload existing data. The format conversion occurs incrementally during background flush or compaction processes, with minimal disruption to ongoing operations.
Automatic Row-Column Conversion
When a table is configured with a columnar format, SmartEngine automatically handles row-column conversions during read and write operations.
- Write Process: Data from the SQL layer is initially converted into SmartEngine’s row format. This row data is stored in the memtable, an in-memory structure optimized for fast writes. Upon flushing the memtable to persistent storage, SmartEngine converts the row format into columnar format.
- Read Process: SmartEngine reconstructs row data from the columnar format during reads. At the handler layer, the row format is then converted back into the SQL layer record format. This automatic format conversion enables SmartEngine to store data in a columnar format for space efficiency in persistent storage, while utilizing a row format in memory for optimal performance during processing.
Schema Propagation
In SmartEngine, under the row format, each record is stored as a key-value pair. The key represents the primary key used for data access and supports memcompare operations, enabling efficient in-memory comparisons. The value contains essential information such as null field indicators, instant DDL metadata, primary key unpacking information, and non-primary key columns.
During row-to-column format conversion in SmartEngine, each row is decomposed into individual columns, with the schema guiding how each field is processed and stored in its respective column unit. Therefore, schema information is critical for correctly interpreting and structuring the data. The schema defines how data fields should be interpreted, mapped, and transformed, ensuring accurate processing of the relevant columns and data types during the conversion.
To fully implement row-to-column conversion in the storage engine layer, SmartEngine propagates the schema information, originally defined at the SQL layer, to the storage engine layer. SmartEngine stores propagated table schemas in a data structure named SubTableMeta.