Checkpoints and Clones
Status: Accepted
Authors:
Current Implementation
Section titled “Current Implementation”SlateDB runs a Writer, a Compactor - optionally in a separate process, and a Garbage Collector (GC) - also optionally in a separate process. The writer writes data to L0 SSTs. The Compactor then goes and compacts these L0 SSTs into Sorted Runs (SR) under the “compacted” directory, and then further compacts those SRs as they accumulate. Updates to the current set of L0s and Sorted Runs are synchronized between the Writer and Compactor through updates to the centralized manifest. When reading, the Writer gets a reference to a copy of the metadata for the current set of sorted runs and searches them for the requested data. Meanwhile, the GC scans the SSTs in Object Storage and deletes any SSTs that are not referenced by the current manifest and are older than some threshold.
Problem
Section titled “Problem”The current implementation poses a few problems and limitations:
- There is nothing preventing the GC from deleting SSTs out from the Writer while it’s attempting to read their data. This isn’t a big deal today as we only support key lookups, which are fast. But this could become a more acute problem when we support range scans which can be long.
- Currently only the Writer process can interact with the database. We’d like to support additional read-only clients, called Readers that can read data.
- Some use cases can benefit from forking the database and establishing a new db that relies on the original DB’s SSTs but writes new data to a different set of SSTs from the original writer.
In this RFC, we propose adding checkpoints and clones to address this limitation.
- Clients will be able to establish cheap db checkpoints that reference the current version of SlateDB’s manifest.
- Writers will use checkpoints to “pin” a version of the database for reads to avoid the GC from deleting the SSTs.
- Readers will use checkpoints similarly, and many Reader processes can coexist alongside a Writer.
- Finally, we will support creating so-called “clones” from a checkpoint to allow spawning new Writers that write to a “fork” of the original database.
Out Of Scope
Section titled “Out Of Scope”This RFC does not propose:
- Any mechanism to provide isolation guarantees from other concurrent reads/writes to the database.
Proposal
Section titled “Proposal”At a high level, this proposal covers a mechanism for establishing database checkpoints, and reviews how the various processes will interact with this mechanism. Logically, a checkpoint refers to a consistent and durable view of the database at some point in time. It is consistent in the sense that data read from a checkpoint reflects all writes up to some point in time, and durable in the sense that this property holds across process restarts. Physically, a checkpoint refers to a version of SlateDB’s manifest.
Manifest Schema
Section titled “Manifest Schema”Checkpoints themselves are also stored in the manifest. We’ll store them using the following schema:
// Reference to an external database.table ExternalDb { // Path to root of the external database path: string (required);
// Externally owned Checkpoint ID we've used to create an initial state of cloned database. source_checkpoint_id: Uuid (required);
// Checkpoint ID this database has placed on the external database that prevents referenced // data files from being GC'd. Both final_checkpoint_id and source_checkpoint_id should resolve // to the same manifest_id as long as they both still exist. final_checkpoint_id: Uuid (required);
// Compacted SST IDs belonging to external DB that are currently being referenced. sst_ids: [Ulid] (required);}
// A view referencing a CompactedSsTable by its id. This indirection allows the same// physical SST to appear multiple times in the manifest with different visible ranges// (e.g. after projection and union), without duplicating SST metadata.table CompactedSsTableView { // Reference to a CompactedSsTable by its id (stored at the top level of ManifestV2). id: Ulid (required); visible_range: BytesRange;}
table SortedRunV2 { id: uint32; ssts: [CompactedSsTableView] (required);}
table ManifestV2 { // List of external databases referenced by this manifest. external_dbs: [ExternalDb];
// Flag to indicate whether initialization has finished. When creating the initial manifest for // a root db (one that is not a clone), this flag will be set to true. When creating the initial // manifest for a clone db, this flag will be set to false and then updated to true once clone // initialization has completed. initialized: boolean;
// Optional epoch time in seconds that this database was destroyed destroyed_at_s: u64;
...
// All compacted SSTs referenced by l0 and compacted (sorted runs) views. // This is the single source of truth for SST metadata (id, info, format_version). ssts: [CompactedSsTable] (required);
// L0 and sorted run entries reference SSTs via CompactedSsTableView, which holds // only the SST id and a visible_range. This allows the same SST to be referenced // multiple times with different visible ranges. l0: [CompactedSsTableView] (required); compacted: [SortedRunV2] (required);
// A list of current checkpoints that may be active (note: this replaces the existing // snapshots field) checkpoint: [Checkpoints] (required);}
table WriterCheckpoint { epoch: u64;}
union CheckpointMetadata { WriterCheckpoint }
table Uuid { high: uint64; low: uint64;}
// Checkpoint reference to be included in manifest to record active checkpoints.table Checkpoint { // Id that must be unique across all open checkpoints. id: Uuid (required);
// The manifest ID that this checkpoint is using as its `CoreDbState`. manifest_id: ulong;
// The UTC unix timestamp seconds that a checkpoint expires at. Clients may update this value. // If `checkpoint_expire_time_s` is older than now(), the checkpoint is considered expired. // If `checkpoint_expire_time_s` is 0, the checkpoint will never expire. checkpoint_expire_time_s: uint;
// The unix timestamp seconds that a checkpoint was created. checkpoint_create_time_s: uint;
// Optional metadata associated with the checkpoint. For example, the writer can use this to // clean up checkpoints from older writers. metadata: CheckpointMetadata;
// Optional name associated with the checkpoint. The name can be used to list the checkpoints. // Note that name may not be unique and the list operation can return multiple checkpoints with // the same name. name: string;}Manifest V1 to V2 Migration
Section titled “Manifest V1 to V2 Migration”The introduction of ManifestV2 requires a phased rollout to ensure forward compatibility across mixed-version deployments:
-
Phase 1 (current): Write V1, read V1+V2. All nodes can read the new V2 format, but manifests are still written in V1 format. This allows older nodes that only understand V1 to continue operating during a rolling upgrade. To support this, a
last_compacted_l0_sst_idfield is maintained alongsidelast_compacted_l0_sst_view_id, since V1 encodesl0_last_compactedas an SST ID rather than a view ID. -
Phase 2: Write V2, read V1+V2. Once all nodes are upgraded and can read V2, the writer switches to emitting V2 manifests. V1 reading is retained for backward compatibility with existing manifests on disk.
-
Phase 3: Deprecate V1. Once no V1 manifests remain on disk (i.e., all have been superseded), V1 reading support can be removed.
Public API
Section titled “Public API”We’ll make the following changes to the public API to support creating and using checkpoints:
/// Defines the scope targeted by a given checkpoint. If set to All, then the checkpoint will include/// all writes that were issued at the time that create_checkpoint is called. SlateDB will flush WALs/// (if enabled) and do a best-effort flush of memtables to L0 SSTs in order to optimize for/// readers. The memtable flush will not await if blocked by backpressure. If set to Durable, then/// the checkpoint includes only writes that were durable at the time of the call. This will be/// faster, but may not include data from recent writes.enum CheckpointScope { All, Durable}
/// Specify options to provide when creating a checkpoint.struct CheckpointOptions { /// Optionally specifies the lifetime of the checkpoint to create. The expire time will be set to /// the current wallclock time plus the specified lifetime. If lifetime is None, then the checkpoint /// is created without an expiry time. lifetime: Option<Duration>,
/// Optionally specifies an existing checkpoint to use as the source for this checkpoint. This is /// useful for users to establish checkpoints from existing checkpoints, but with a different lifecycle /// and/or metadata. source: Option<Uuid>
/// Optionally specifies a name for the checkpoint. Can be used to list the checkpoints. pub name: Option<String>,}
#[derive(Debug)]pub struct CheckpointCreateResult { /// The id of the created checkpoint. id: uuid::Uuid, /// The manifest id referenced by the created checkpoint. manifest_id: u64,}
impl Db { /// Creates a checkpoint of an opened db using the provided options. Returns the ID of the created /// checkpoint and the id of the referenced manifest. pub async fn create_checkpoint( &self, scope: CheckpointScope, options: &CheckpointOptions, ) -> Result<CheckpointCreateResult, SlateDBError> { … }
/// Refresh the lifetime of an existing checkpoint. Takes the id of an existing checkpoint /// and a lifetime, and sets the lifetime of the checkpoint to the specified lifetime. If /// there is no checkpoint with the specified id, then this fn fails with /// SlateDBError::InvalidDbState pub async fn refresh_checkpoint( path: &Path, object_store: Arc<dyn ObjectStore>, id: Uuid, lifetime: Option<Duration>, ) -> Result<(), SlateDBError> {}
/// Deletes the checkpoint with the specified id. pub async fn delete_checkpoint( path: &Path, object_store: Arc<dyn ObjectStore>, id: Uuid, ) -> Result<(), SlateDBError> {}
/// Called to destroy the database at the given path. If `soft` is true, This method will /// set the destroyed_at_s field in the manifest. The GC will clean up the db after some /// time has passed, and all checkpoints have either been deleted or expired. As part of /// cleaning up the db, the GC will also remove the database’s checkpoint from the parent /// database. If `soft` is false, then all cleanup will be performed by the call to this /// method. If `soft` is false, the destroy will return SlateDbError::InvalidDeletion if /// there are any remaining non-expired checkpoints. pub async fn destroy(path: Path, object_store: Arc<dyn ObjectStore>, soft: bool) -> Result<(), SlateDbError> { … }}
mod admin { /// Creates a checkpoint of the db stored in the object store at the specified path using the provided options. /// Note that the scope option does not impact the behaviour of this method. The checkpoint will reference /// the current active manifest of the db. pub async fn create_checkpoint<P: Into<Path>>( path: P, object_store: Arc<dyn ObjectStore>, options: &CheckpointOptions, ) -> Result<CheckpointCreateResult, SlateDBError> {}
/// Clone a Db from a checkpoint. If no db already exists at the specified path, then this will create /// a new db under the path that is a clone of the db at parent_path. A clone is a shallow copy of the /// parent database - it starts with a manifest that references the same SSTs, but doesn't actually copy /// those SSTs, except for the WAL. New writes will be written to the newly created db and will not be /// reflected in the parent database. The clone can optionally be created from an existing checkpoint. If /// parent_checkpoint is None, then the manifest referenced by parent_checkpoint is used as the base for /// the clone db's manifest. Otherwise, this method creates a new checkpoint for the current version of /// the parent db. pub async fn create_clone<P: Into<Path>>( clone_path: P, parent_path: P, object_store: Arc<dyn ObjectStore>, parent_checkpoint: Option<Uuid>, ) -> Result<(), Box<dyn Error>> { … }}
/// Configuration options for the database reader. These options are set on client startup.#[derive(Clone)]pub struct DbReaderOptions { /// How frequently to poll for new manifest files. Refreshing the manifest file allows readers /// to detect newly compacted data. pub manifest_poll_interval: Duration,
/// For readers that refresh their checkpoint, this specifies the lifetime to use for the /// created checkpoint. The checkpoint’s expire time will be set to the current time plus /// this value. If not specified, then the checkpoint will be created with no expiry, and /// must be manually removed. This lifetime must always be greater than /// manifest_poll_interval x 2 pub checkpoint_lifetime: Option<Duration>,
/// The max size of a single in-memory table used to buffer WAL entries /// Defaults to 64MB pub max_memtable_bytes: u64}
pub struct DbReader { …}
impl DbReader { /// Creates a database reader that can read the contents of a database (but cannot write any /// data). The caller can provide an optional checkpoint. If the checkpoint is provided, the /// reader will read using the specified checkpoint and will not periodically refresh the /// checkpoint. Otherwise, the reader creates a new checkpoint pointing to the current manifest /// and refreshes it periodically as specified in the options. It also removes the previous /// checkpoint once any ongoing reads have completed. pub async fn open( path: Path, object_store: Arc<dyn ObjectStore>, checkpoint: Option<UUID>, options: DbReaderOptions ) -> Result<Self, SlateDBError> { ... }
/// Read a key an return the read value, if any. pub async fn get(&self, key: &[u8]) -> Result<Option<Bytes>, SlateDBError> { ... }}
pub struct GarbageCollectorOptions { /// Defines a grace period for deletion of a db. A db for which deletion is handled by the GC /// will not be deleted from object storage until this duration has passed after it's deletion /// timestamp. pub db_delete_grace: Duration,}Finally, we’ll also allow users to create/refresh/delete/list checkpoints using the admin CLI:
$ slatedb create-checkpoint --helpCreate a new checkpoint pointing to the database's current state
Usage: slatedb --path <PATH> create-checkpoint [OPTIONS]
Options: -l, --lifetime <LIFETIME> Optionally specify a lifetime for the created checkpoint. You can specify the lifetime in a human-friendly format that uses years/days/min/s, e.g. "7days 30min 10s". The checkpoint's expiry time will be set to the current wallclock time plus the specified lifetime. If the lifetime is not specified, then the checkpoint is set with no expiry and must be explicitly removed
-s, --source <UUID> Optionally specify an existing checkpoint to use as the base for the newly created checkpoint. If not provided then the checkpoint will be taken against the latest manifest.
-n, --name <NAME> Optionally specify a name for the checkpoint. The name can be used to list the checkpoints. Note that name may not be unique and the list operation can return multiple checkpoints with the same name
$ slatedb refresh-checkpoint --helpRefresh an existing checkpoint's expiry time. This command will look for an existing checkpointand update its expiry time using the specified lifetime
Usage: slatedb --path <PATH> refresh-checkpoint [OPTIONS] --id <ID>
Options: -i, --id <ID> The UUID of the checkpoint (e.g. 01740ee5-6459-44af-9a45-85deb6e468e3) -l, --lifetime <LIFETIME> Optionally specify a new lifetime for the checkpoint. You can specify the lifetime in a human-friendly format that uses years/days/min/s, e.g. "7days 30min 10s". The checkpoint's expiry time will be set to the current wallclock time plus the specified lifetime. If the lifetime is not specified, then the checkpoint is updated with no expiry and must be explicitly removed
$ slatedb delete-checkpoint --helpDelete an existing checkpoint
Usage: slatedb --path <PATH> delete-checkpoint --id <ID>
Options: -i, --id <ID> The UUID of the checkpoint (e.g. 01740ee5-6459-44af-9a45-85deb6e468e3)
$ slatedb list-checkpoints --helpList the current checkpoints of the db
Usage: slatedb --path <PATH> list-checkpoints [OPTIONS]
Options: -n, --name <NAME> Optionally specify the name to filter the checkpoints. Note that name may not be unique and the list operation can return multiple checkpoints with the same name. If not specified, all checkpoints will be returned.Creating a Checkpoint
Section titled “Creating a Checkpoint”Checkpoints can be created using the create_checkpoint API, the create-checkpoint CLI command, or by the
writer/reader clients as part of their normal operations (see below sections).
Checkpoints can be taken from the current manifest, or from an existing checkpoint. Checkpoints cannot be created from arbitrary manifest versions.
Generally to take a checkpoint from the current manifest, the client runs a procedure that looks like the following:
- Get the current manifest M at version V.
- If M.
initializedis false, or M.destroyed_at_sis set, exit with error. - Compute a new v4 UUID id for the checkpoint.
- Create a new entry in
checkpointswith:idset to the id computed in the previous stepmanifest_idset to V or V+1. We use V+1 to allow addition of the checkpoint to be included with other updates.- other fields set as appropriate (e.g. expiry time based on relevant checkpoint creation params)
- Write a new manifest M’ at version V+1. If CAS fails, go to step 1.
To take a checkpoint from an existing checkpoint with id S, the client runs the following procedure:
- Get the current manifest M at version V.
- If
destroyed_at_sis set, exit with error. - Look for the checkpoint C with id S under
checkpoints. If there is no such checkpoint, exit with error. - Check that C is not expired. If it is, then exit with error.
- Compute a new v4 UUID id for the checkpoint.
- Create a new entry in
checkpointswith:idset to the id computed in the previous stepmanifest_idset to C.manifest_id- other fields set as appropriate (e.g. expiry time based on relevant checkpoint creation params)
- Write a new manifest M’ at version V+1. If CAS fails, go to step 1.
The create-checkpoint CLI and create_checkpoint API will run exactly these procedures when called without
and with a source checkpoint ID, respectively.
Otherwise, the exact procedures used by the writer/reader clients vary slightly depending on the context. For example, the writer may include other updates along with the update that adds the checkpoint. These variants are detailed in the sections below.
Writers
Section titled “Writers”Writers will create checkpoints to “pin” the set of SRs currently being used to serve reads. As the writer consumes manifest updates, it will re-establish its checkpoint to point to the latest manifest version. Eventually, when we support range scans, the writer will retain older checkpoints to ensure that long-lived scans can function without interference from the GC.
Establishing Initial Checkpoint
Section titled “Establishing Initial Checkpoint”The writer first establishes it’s checkpoint against the current manifest when starting up, as part of incrementing the epoch:
- Read the latest manifest version V.
- Compute the writer’s epoch E
- Look through the checkpoints under
checkpointfor any entries withmetadataof typeWriterCheckpoint, and remove any such entries. - Create a new checkpoint with:
idset to a new v4 uuidmanifest_idset to the next manifest version (V + 1)checkpoint_expire_time_sset to Nonemetadataset toWriterCheckpoint{ epoch: E }.
- Set
writer_epochto E - Write the manifest. If this fails with a CAS error, go back to 1.
Refreshing the Checkpoint
Section titled “Refreshing the Checkpoint”The writer will periodically re-establish its checkpoint to pin the latest SSTs and SRs. It will run the following
whenever it detects that the manifest has been updated. This happens either when the manifest is polled
(at manifest_poll_interval), or when the writer detects an update when it goes to update the manifest with it’s local
view of the core db state:
- Read the latest manifest version V. (If the writer is applying an update from the local state then in the first iteration it can assume that it’s local version is the latest. If it’s wrong, the CAS will fail and it should reload the manifest from Object Storage))
- Decide whether it needs to re-establish it’s checkpoint. The writer reestablishes the checkpoint whenever the set of
L0 SSTs, or Sorted Runs has changed. This includes both changes made locally by the Writer, or by the Compactor.
- If so, create a new checkpoint with
idset to a new v4 UUID.manifest_idset to V+1.checkpoint_expire_timeset to None.metadataset toWriterCheckpoint{ epoch: E }.
- Delete the old checkpoint from
checkpoints. When we support range scans, we’ll instead want to maintain references to the checkpoint from the range scans and delete the checkpoint when all references are dropped. (We should technically do this for point lookups too, but the likelihood of GC deleting an SST in this case is very low). - Apply any local changes
- Write the new manifest. If this fails with a CAS error, go back to 1.
- If so, create a new checkpoint with
Clones
Section titled “Clones”The writer will also support initializing a new database from an existing checkpoint. This allows users to “fork” an instance of slatedb, allowing it to access all the original db’s data from the checkpoint, but isolate writes to a new db instance.
To support this, we add the Db::open_from_checkpoint method, which accepts a parent_path and Option<parent_checkpoint>. When initializing the cloned database, the writer will do the following:
- Read the current cloned manifest
Option<M_c>. - Read the current parent manifest
M_patparent_path.
- If
M_p.initializedis false, exit with error.
- If
Option<M_c>is present
- Validate
parent_pathandOption<parent_checkpoint>are contained inM_c.external_dbs.- If not, exit with error.
- If
Option<parent_checkpoint_id>is not set, accept any checkpoint.
- If
M_c.initializedis- True
- Validate all external databases have a final checkpoint. If not, exit with error.
- Go to step 11
- False: Go to step 10
- True
- Compute the clone’s
source_checkpoint_id:
- If
Option<parent_checkpoint>is present use it. - Otherwise, create a new ephemeral checkpoint of parent’s latest manifest and use that. Ephemeral checkpoint will have TTL of 5 minutes, giving us an upper bound for retrying the clone operation.
- Create a new clone manifest
M_c'.
- Assign
final_checkpoint_idto a new random v4 UUID. - Resolve parent manifest
M_p'atsource_checkpoint_idand use it as the basis for the clone. - Set
M_c.initializedto False.
- Write a new clone manifest
M_c'. If CAS fails, go to step 1.M_c'fields are set to:
- external_dbs:
- initial list copied from
M_p'(this list is non-empty for nested clones)- for each
entryin the list: setentry.final_checkpoint_idtofinal_checkpoint_id
- for each
- add an
entryforparent_pathwith:entry.pathset toparent_pathentry.source_checkpoint_idset tosource_checkpoint_identry.final_checkpoint_idset tofinal_checkpoint_id
- initial list copied from
- writer_epoch: set to the epoch from
M_p' + 1. - compactor_epoch: copied from
M_p'. - wal_id_last_compacted: copied from
M_p' - wal_id_last_seen: copied from
M_p' - l0: copied from
M_p' - l0_last_compacted: copied from
M_p' - compacted: copied from
M_p' - checkpoints: empty
- Assign
M_c = M_c' - For each external database
EDinM_c
- Ensure final checkpoint with
ED.final_checkpoint_idexists. If not, create it with following options:CheckpointOptions{ lifetime: None, source: Some(ED.source_checkpoint_id) }- If checkpoint creation fails, because the source checkpoint does not exist, exit with error. This can happen when ephemeral checkpoint expires.
- Copy over all WAL SSTs between
M_c.last_compacted_wal_idandM_c.wal_id_last_seenfromparent_path. - Update
M_cwithinitializedset to true. If CAS fails, go to step 1.
SST Path Resolution
Section titled “SST Path Resolution”The DB may now need to look through multiple paths to find SST files, since it may need to open SST files written by an
external db. To support this, we’ll materialize a map of external SSTs at startup by reading external_dbs from the manifest and pass it to the PathResolver for use.
Deleting a Database
Section titled “Deleting a Database”With the addition of clones, we need to be a bit more careful about how we delete a database. Previously we could
simply remove it’s objects from the Object Store and call it a day. With clones, it’s possible that a SlateDB instance
is holding a checkpoint that another database relies upon. So it’s not safe to just delete that data. We’ll instead
require that users delete a database by calling the Db::destroy method. Db::destroy can specify that the delete is
either a soft or hard delete by using the soft parameter. This way we can support both more conservative deletes that
clean up data from the GC process after some grace period, and a simple path for hard-deleting a db.
A soft delete will simply mark the manifest as deleted, and rely on a separate GC process to delete the db. It requires that the GC be running in some other long-lived process from the main writer. The GC process will wait for both some grace period, and for all checkpoints to either expire or be deleted. At that point it will delete the db’s SSTs and exit. This is covered in more detail in the GC section. A soft delete will be processed by the writer as follows:
- Claim write-ownership of the DB by bumping the epoch and fencing the WAL. The intent here is to fence any active writer. We assume that the writer will detect that it’s fenced and exit by the time the GC goes to delete the data.
- Delete the writer’s checkpoint, if any.
- Update the manifest by setting the
destroyed_at_sfield to the current time.
A hard delete will proceed as follows. It is up to the user to ensure that there is no active writer at the time of deletion:
- Check that there are no active checkpoints. If there are, return
InvalidDeletion - If
destroyed_at_sis not set, then update the manifest by settingdestroyed_at_sfield to the current time. If CAS fails, go to step 1. - Delete all objects under
pathother than the current manifest. - Delete the current manifest.
Readers
Section titled “Readers”Readers are created by calling DbReader::open. open takes an optional checkpoint. If a checkpoint is provided,
DbReader uses the checkpoint and does not try to refresh/re-establish it. If no checkpoint is provided, DbReader
establishes its own checkpoint against the latest manifest and periodically polls the manifest at the interval specified
in manifest_poll_interval to see if it needs to re-establish it. If the manifest has not changed in a way that
requires the reader to re-establish the checkpoint, then the reader will refresh the checkpoint’s expiry time once half
the lifetime has passed. The checkpoints created by DbReader use the lifetime specified in the checkpoint_lifetime
option.
Establishing the Checkpoint
Section titled “Establishing the Checkpoint”To establish the checkpoint, DbReader will:
- Read the latest manifest M at version V.
- Check that M.
initializedis true and M.destroyed_at_sis not set. Otherwise, return error. - Create a new checkpoint with id set to a new v4 uuid,
manifest_idset to the next manifest version (V + 1),checkpoint_expire_time_sset to the current time pluscheckpoint_lifetime, andmetadataset to None. - Update the manifest’s
wal_id_last_seento the latest WAL id found in the wal directory - Write the manifest. If this fails with a CAS error, go back to 1.
Then, the reader will restore data from the WAL. The reader will just replay the WAL data into an in-memory table.
The size of the table will be bounded by max_memtable_bytes. When the table fills up, it’ll be added to a list and a
new table will be initialized. The total memory used should be limited by backpressure applied by the writer. This
could still be a lot of memory (e.g. up to 256MB in the default case). In the future we could have the reader read
directly from the WAL SSTs, but I’m leaving that out of this proposal.
Checkpoint Maintenance
Section titled “Checkpoint Maintenance”If the reader is opened without a checkpoint, it will periodically poll the manifest and potentially re-establish the checkpoint that it creates if the db has changed in a way that requires re-establishing the checkpoint (the set of SSTs has changed). It does this by re-running the initialization process described above, with the following modifications:
- It will delete the checkpoint that it had previously created (when we support range scans, we’ll need to reference
count the checkpoint).
- It will purge any in-memory tables with content from WAL SSTs lower than the new
wal_id_last_compacted.
- It will purge any in-memory tables with content from WAL SSTs lower than the new
Additionally, if the time until checkpoint_expire_time_s is less than half of checkpoint_lifetime the Reader will
update the checkpoint_expire_time_s to the current time plus checkpoint_lifetime.
Compactions
Section titled “Compactions”During compaction, if an external SST is dereferenced, we’ll remove its entry from the corresponding external_dbs[].sst_ids set and update the manifest.
Garbage Collector
Section titled “Garbage Collector”The GC task will be modified to handle soft deletes and manage checkpoints/clones. The GC tasks’s algorithm will now look like the following:
- Read the current manifest. If no manifest is found, exit.
- If
destroyed_at_sis set, and the current time is afterdestroyed_at_s+db_delete_grace, and there are no active checkpoints, then:- If the db is a clone, update the parent db’s manifest by deleting the checkpoint.
- Delete all object’s under the db’s path other than the current manifest.
- Delete the current manifest.
- Exit.
- Garbage collect expired checkpoints
- Find any checkpoints that have expired and remove them
- Write the new manifest version with checkpoints deleted. If CAS fails, go back to 1 to reload checkpoints that may be refreshed or added.
- Delete garbage manifests. This includes any manifest that is older than
min_ageand is not referenced by a checkpoint. - Read all manifests referenced by a checkpoint. Call this set M
- Clean up WAL SSTs.
- For a given Manifest n Let referenced_wal(n) be the set of all WAL SSTs between n.
wal_id_last_compactedand n.wal_id_last_seen - Let W be the set of all referenced_wal(n) for all n in M AND all WAL SSTs larger than
wal_id_last_compactedof the current manifest. - Delete all WAL SSTs not in W.
- For a given Manifest n Let referenced_wal(n) be the set of all WAL SSTs between n.
- Clean up SSTs.
- Let S be the set of all SSTs from all manifests in M.
- Delete all SSTs not in M with age older than
min_age
- Detach the clone if possible. If list of external databases is non-empty, then for each external database
DB:- If
DB.sst_idsis empty at the latest version of the manifest and all existing checkpoints, then:- Remove
DB.final_checkpoint_idfrom the external db’s manifest. - Update the manifest by removing
DBfrom the list of external databases.
- Remove
- If
Observe that users can now configure a single GC process that can manage GC for multiple databases that use soft deletes. Whenever a new database is created, the user needs to spawn a new GC task for that database. When the GC task completes deleting a database, then the task exits. For now, it’s left up to the user to spawn GC tasks for databases that they have created.
Manifest Projection and Union
Section titled “Manifest Projection and Union”Clones create new databases that reference SSTs from a parent. In some cases, it is useful to create views over a subset of a manifest’s key range, or to combine multiple non-overlapping manifest views into a single manifest. In particular, this can be useful to implement database rescaling. We introduce two operations on manifests to support this: projection and union.
Projection
Section titled “Projection”Projection creates a filtered view of a manifest restricted to a specific key range specified by the client. Given a
source manifest and a target range, projection returns a new manifest containing only the SSTs whose key ranges overlap
with the target range. SSTs that fall entirely outside the target range are excluded, and SSTs that partially overlap
are annotated with a visible_range that constrains which keys are accessible.
Write requests that are fully or partially outside the visible_range will fail, as well as read requests completely
outside the visible_range.
The new SSTs produced by Writer and Compactor do not have visible_range attached, although the actual range of keys
in those SSTs lies inside visible_range (in the future, we might want to attach visible_range to the whole manifest).
The projection process works as follows:
- Clone the source manifest.
- For each L0 SST view, compute the intersection of the SST’s effective range with the projection range. L0 SSTs may overlap with each other, so each SST is evaluated independently against the projection range.
- For each sorted run, compute the intersection of each SST’s range with the projection range. Since SSTs within
a sorted run are non-overlapping and ordered, the range of each SST is bounded by the start key of the next SST
in the run (i.e., the range is
[current_start, next_start)). The last SST in a run extends to the end of its effective range. - For each SST view that has a non-empty intersection with the projection range, create a new view with the
visible_rangeset to the intersection. SSTs with an empty intersection are excluded from the result. - Sorted runs whose SSTs were all excluded are removed entirely. Note that removing sorted runs may leave
gaps in sorted run IDs (e.g.,
[5, 3, 1]if SR 4 and 2 were removed). This is acceptable because the compactor does not assume contiguous IDs — it matches sorted runs by ID, not by position. Preserving original IDs is important so that union can correctly match the same logical tier across projected manifests. - Remove excluded SST IDs from
external_dbsentries so that the external database’s checkpoint can be released and the SSTs garbage collected once no other manifest references them. Entries whosesst_idslist becomes empty are removed entirely, allowing their checkpoint on the external database to be released.
The visible_range on an SsTableView constrains the keys that are visible when reading the SST. The
effective_range (computed and not stored) is the intersection of the SST’s physical range (derived from its first key)
and the visible_range. Scans and lookups against a projected SST only return keys within the effective range.
Projection is logically equivalent to deleting all entries outside the projected range. Once an SST has been
projected, its visible_range can only be narrowed further, never extended — the entries outside the original
projection are treated as if they no longer exist.
Note: currently, size estimation for the projected SSTs is not accurate, which might lead to suboptimal compactor decisions. This will be improved in the future iterations.
Note: currently, it is the client responsibility to maintain the ranges associated with SlateDB instances and supply them to Projection and Union.
Union combines multiple non-overlapping manifests into a single manifest. This is the inverse of projection: given a set of manifests that each cover a distinct portion of the key space, union produces a manifest that covers the combined key space. The source databases don’t have to share the root database (e.g. they can be independent shards).
Union does not support merging WAL state at the moment. Input manifests must have their WAL fully compacted before
performing a union.
This can be achieved for example by calling db.flush_with_options(... FlushOptions { flush_type: FlushType::MemTable });
to force a flush to L0 (all writes prior to that flush will be guaranteed to be in L0+ and in the manifest); or setting
wal_enabled: false from the beginning.
The union process works as follows:
- Sort the input manifests by the start bound of their key ranges.
- Validate that the manifests are non-overlapping. Each manifest’s key range is computed from the effective ranges
of all its L0 and compacted SSTs and optionally intersected with
visible_ranges for each manifest if they are provided by the user. If any two manifests have intersecting key ranges, the operation fails. If thevisible_rangesare explicitly provided then validate that they are adjacent. - Merge the contents of all input manifests:
external_dbsentries from all input manifests are merged and deduplicated by(path, source_checkpoint_id).external_dbswith the same(path, source_checkpoint_id)originated from the same external_db entry via projection, so theirsst_idslists are merged (unioned). Entries with different keys for the same path represent distinct checkpoints and must be kept separate. Without this deduplication, repeated cycles of projection and union cause exponential growth of duplicated entries: projecting into N parts and unioning back multiplies the entry count by N each cycle. Newfinal_checkpoint_idvalues are generated for all entries — the old ones belong to the source databases’ clone relationships and are not valid for the new database.- SSTs with the same ID but originating from different source databases are deduplicated by assigning a new ID
(this might happen because ULID is not guaranteed to be globally unique) and updating the corresponding views
accordingly. To preserve the order of ULIDs within the same database, subsequent SSTs might require new IDs as
well. The old ID is kept in
CompactedSsTableV2asoriginal_idand inexternal_db.sst_idsto allow path resolution - L0 SST views are copied from the source databases (preserving the order within each source database)
- Sorted runs are copied from the source manifests preserving their relative order within each source manifest. Their IDs are regenerated to maintain uniqueness. As an optimization, similarly sized SRs can be merged together to reduce LSM height and metadata size; only SRs from different manifests can be merged because they are guaranteed to be non-overlapping.
last_l0_seqis set to the maximum oflast_l0_seqacross all input manifests. This ensures that the new database’s writer assigns sequence numbers higher than any existing data. Without this, new writes could receive sequence numbers that collide with those in the carried-over SSTs, breaking snapshot isolation and MVCC ordering. Note that existing SSTs from independent sources may have overlapping sequence numbers, but this is safe because the non-overlapping key range validation (step 2) guarantees that no two entries for the same key can have conflicting sequence numbers.- The resulting manifest has
initializedset tofalse. The caller must complete setup (e.g., creating final checkpoints in source databases) before settinginitializedtotrue. - Each source database is added as an
external_dbsentry in the resulting manifest, with a newfinal_checkpoint_idand its owned SST IDs recorded so that the new database can resolve SST paths and maintain checkpoints on the source databases to prevent their SSTs from being garbage collected. Owned SST IDs are those in the source manifest that are not already tracked by its ownexternal_dbs(i.e., SSTs physically stored under the source’s path).
Merging SST
Section titled “Merging SST”SST views coming from different databases might refer to the same SSTs (but have different visible_ranges). It might
be beneficial to merge such views to achieve the following:
- reduce the chance of hitting
l0_max_sstson writes right after union - improve accuracy of size estimates (used by Compactor)
- reduce the size of the metadata
However, there are a number of constraints that must be met when merging SST views:
- Due to compaction, views over the same SST might not be adjacent, i.e. have gaps; preserving those gaps is necessary to provide correctness; therefore, only the adjacent views can be merged.
- Sorted Run boundaries (and L0/SR boundary) can not be crossed
- Within SR, SST views must be sorted by key and can not overlap
- L0 ordering: newer SST views must precede views over older SSTs
The last constraint could be met by ordering SST views by their underlying SSTs’ logical creation time. This time is normally represented by ULID; however, there’s a slight chance that due to a wall clock skew, parents’ SST ULIDs are later than the children’s ones. Instead, logical creation time can be inferred from topological order of SST (with dependencies implied from the order of SST in L0 lists).
Apart from the above union process, cloning from multiple source databases is the same as clone (see Clones), including creation of final checkpoints in source databases (and their source databases, recursively).
There are a few minor differences:
- Writer and compactor epochs are 0. The new database is a fresh instance with no prior writer or compactor history, so both epochs start at 0.
- Default manifest core fields. Fields such as
next_wal_sst_idand similar manifest core metadata are initialized to their default values rather than being copied from any source database. An exception islast_l0_seq, which is derived from the maximum across all sources during union (see step 5 above). - WAL is not carried over from any source database. All source manifests must have their
WAL fully compacted before the operation. The new database starts with an empty WAL. That also allows to start with
new
wal_id_last_seenandreplay_after_wal_idon union - SequenceTracker is re-initialized on union; Sequence Tracking only works for the records added afterwards
- Due to a temporary spike in the number of L0 after union,
l0_max_sstsmight be exceeded blocking the writes (until compaction). This is partially mitigated bymax_unflushed_bytesbuffer that can absorb the spike. Alternatively, this can be improved by merging adjacent SST views; and/or checkingl0_max_sstsagainst a maximum number of views per any key.
Note: union works correctly starting from Manifest V2 with the introduction of SstViewRange. Without it, SST in L0 and
SortedRuns might be duplicated, which can lead to incorrect results and violates assumptions in Compactor and other
places.