Overview
SlateDB is a log-structured merge-tree (LSM tree). We must first understand how LSM trees work to understand SlateDB’s design. In this page, we’ll explore LSMs and see how SlateDB uses them.
What is an LSM tree?
Section titled “What is an LSM tree?”LSM trees are an in-memory and on-disk data structure used to store key-value (KV) data. KV storage engines such as RocksDB, LevelDB, and Sled, are based on LSM trees. Databases such as ClickHouse They are a type of key-value (KV) storage engine. LSMs provide operations common to most KV storage engines:
put(key, value)
: Insert a key-value pair.get(key)
: Retrieve a key-value pair.delete(key)
: Delete a key-value pair.scan(range)
: Scan a range of keys.flush()
: Flush the in-memory data to disk.
LSM tree design
Section titled “LSM tree design”LSM trees are made up of a write-ahead log (WAL), a mutable in-memory map called a MemTable, sorted strings tables (SSTables), and a manifest. We will discuss each of these components in detail.
Write-ahead log (WAL)
Section titled “Write-ahead log (WAL)”A put() call writes its data to a write-ahead log (WAL). A WAL is an append-only persistent log. WALs are used to recover data in the event of a crash. When the database restarts, it can replay the WAL to recover its state. WALs are not great for serving reads, though; they’re not optimized for random access. Every write is simply an appended to the end of the log. A get() operation must scan the WAL from the last write to the first write until it finds the key-value pair (or until it reaches the beginning of the WAL).
MemTables
Section titled “MemTables”LSM trees avoid this linear scan by storing their data in an additional data structure called a MemTable. After a put() call writes its data to the WAL, it inserts its data into a mutable in-memory map called a MemTable. MemTabls typically use a sorted map such as a SkipList map so that scan()
operations work as well.
sequenceDiagram participant C as Client participant WAL as Write-Ahead Log (WAL) participant MT as MemTable (mutable) C->>WAL: (1) put()/delete() WAL-->>C: append OK C->>MT: (2) put()/delete() MT-->>C: update OK
When a get() occurs, the database will search the MemTable for the key-value pair.
sequenceDiagram participant C as Client participant MT as MemTable (mutable) C->>MT: get() MT-->>C: key-value pair
SSTables
Section titled “SSTables”But an in-memory MemTable will only allow us to have a database that fits in memory. We must persist our data on disk. We’ve already seen that the WAL is an inappropriate format for serving reads. LSM trees introduce a third data structure here: a sorted strings table (SSTable). SSTables are files that contain a sequence of key-value pairs sorted by key:
key1:value1key2:value2key3:value3...
We’ve established that MemTables are sorted by key, so LSM trees need only write their MemTable to disk in the same order.
sequenceDiagram box Memory participant C as Client participant MT as MemTable (mutable) end box Disk participant SST as SSTable end C->>MT: put() MT-->>C: update OK MT->>SST: write SST-->>C: write OK
There’s a problem with this design, though. The MemTable must be locked while we write to disk. put()
calls will block until the write is complete. This can cause performance issues if the MemTable is large. To get around this, LSM trees have both a mutable MemTable and one or more immutable MemTables that are being written to disk. When the mutable MemTable reaches a certain size, it is frozen and written to disk in the background. A new mutable MemTable is then created (or pulled from a pool) to receive new writes while the froze MemTable is written in the background.
sequenceDiagram box Memory participant C as Client participant MT as MemTable (mutable) participant IMT as MemTable (immutable) end box Disk participant SST as SSTable end C->>MT: put() MT-->>C: update OK MT->>IMT: freeze IMT->>SST: write
When a get() occurs, the database will search all of the MemTables and SSTables for the key-value pair.
sequenceDiagram box Memory participant C as Client participant MT as MemTable (mutable) participant IMT as MemTable (immutable) end box Disk participant SST as SSTable end C->>SST: get() MT-->>C: key-value pair (if found) IMT-->>C: key-value pair (if found) SST-->>C: key-value pair (if found)
Imagine we have 1000s (or millions) of immutable MemTables and SSTables. A read would need to search them sequentially to find the most recent version of a key-value pair (a user might have updated a key multiple times). This is clearly not efficient. LSM trees solve this problem with compaction. Compaction is a process where the database merges multiple SSTables into a new sequence of range partitioned SSTables. Consider the following SSTables:
SSTable 1: key1:value1a key56:value56a key99:value99
SSTable 2: key43:value43 key56:value56b key89:value89
SSTable 3: key1:value1b key44:value44 key77:value77
Each SST in the example above is unpartitioned—it can potentially contain any key-value pair. Compaction will merge these SSTables into a new sequence of range partitioned SSTables. For example, the SSTables above might be compacted into the following SSTables:
SSTable 1: key1:value1b key43:value43 key44:value44 key56:value56b key77:value77
SSTable 2: key89:value89 key99:value99
Now a read for key56
will only need to search the first SSTable. Readers need only check the key range of each SSTable (stored in memory) to determine if it might contain the key. The unpartitioned tables, are called level 0 (L0) SSTables. The partitioned tables are called sorted runs (SRs). L0 SSTables are part of the write path, as we saw above. Sorted runs are generated through a background compaction process.
::: note
Sorted runs are a logical construct. The files written to disk are the sorted run’s SSTables. The manifest (which we’ll see in a moment) contains the mapping from a sorted run ID to its SSTables.
:::
A get()
will traverse the MemTables, L0 SSTables, and SRs to find the key-value pair.
sequenceDiagram box Memory participant C as Client participant MT as MemTable (mutable) participant IMT as MemTable (immutable) end box Disk participant L0 as L0 SSTable participant SR as SR (sorted run) end C->>SR: get() MT-->>C: key-value pair (if found) IMT-->>C: key-value pair (if found) L0-->>C: key-value pair (if found) SR-->>C: key-value pair (if found)
Manifests
Section titled “Manifests”Another issue remains: upon restart, the database has no easy way to know what SSTables it should load. We might try to list all SSTs and read each file’s metadata, but this doesn’t work. RocksDB’s MANIFEST page explains why:
File system operations are not atomic, and are susceptible to inconsistencies in the event of system failure. Even with journaling turned on, file systems do not guarantee consistency on unclean restart. POSIX file system does not support atomic batching of operations either. Hence, it is not possible to rely on metadata embedded in RocksDB datastore files to reconstruct the last consistent state of the RocksDB on restart.
LSM tree implementations store their state in a manifest. The manifest is a file (or log of files) that stores the database’s state; this includes the location of all SSTables, the WAL recovery starting point, and many other details. The LSM tree reads the manifest on startup to reconstruct its in-memory state. As state changes, the LSM tree writes manifest changes atomically to ensure consistency.
We’ve now built a basic LSM tree. There are many, many details to be explored (deletion tombstones, consistency, WAL replay, and more). Still, this basic implementation will work. If you’re interested in experimenting more with LSMs, we highly recommend checking out Mini-LSM.
Tradeoffs
Section titled “Tradeoffs”LSM trees are elegant, simple data structures. However, they are not without tradeoffs. LSM trees work very well for high write-throughput workloads. It does so by moving as much work as possible out of the write path. Consequently, the design requires more background work (CPU), more files (storage space), and more variable read/latency (I/O) than other data structures. These tradeoffs are commonly framed in the following terms:
- Write amplification: The number of extra bytes written to storage per byte of live user data ingested.
- Read amplification: The number of extra bytes read from storage per byte of result data returned.
- Space amplification: The total on-disk byte size relative to the total live logical byte size.
LSM trees tend to have higher write throughput but worse read locality, more read and space amplification, more background compaction CPU, and higher tail latencies than B-trees. This subject is an active area of research and there are now many strategies to balance these tradeoffs.
SlateDB’s LSM design
Section titled “SlateDB’s LSM design”SlateDB’s LSM design diverges very little from the basic LSM tree design we’ve seen above. SlateDB has a WAL, MemTables, SSTables, sorted runs, and a manifest. The only difference is that all of SlateDB’s data is written to object storage instead of a local disk or filesystem:
sequenceDiagram box Memory participant C as Client participant MT as MemTable (mutable) participant IMT as MemTable (immutable) end box Object Storage participant SST as SSTable participant SR as Sorted Run end C->>MT: put() MT-->>C: update OK MT-->>IMT: freeze IMT-->>SST: write SST-->>SR: compact
In the next section, we’ll look at SlateDB’s object store directory structure.