Timestamps & Time-To-Live

Status: Accepted

Authors:

Almog Gavra

References:

Motivation

SlateDB does not have a native understanding of time in data operations (it does in regard to compaction/garbage collection, discussion on this later). This introduces additional complexity to operators that want to implement semantics that depend on the insertion/modification time of rows. The most common usages of timestamps are:

Time-To-Live: many applications, either for compliance or space-saving purposes require purging data a certain amount of time after its insertion. Use cases with time series data and privacy requirements such as GDPR come to mind here, though ensuring that deleted data is collected within a time bound is left for a future exercise.
Resolving Out-of-Order Operations (out of scope): some applications may process incoming requests in a manner that results in out-of-order insertions to SlateDB (stream processing applications are frequently subject to this type of pattern). Using insertion timestamps to resolve which version of a key is more recent may be preferable to using the clock at the time of insertion.
Historical Reads (out of scope): while SlateDB provides snapshot capabilities to ensure reads are consistent and repeatable, it requires a snapshot to have been taken at the desired time. Increasing the number of snapshots adds significant storage overhead. Instead, it is possible to trade read-latency for space amplification by maintaining multiple versions of keys at different timestamps to allow for similar “historical” read operations.

Goals

Define the API for specifying timestamps via user defined clocks
Define the changes to the on-disk format of stored rows
Outline mechanism to support Time-To-Live (TTL) via Compaction Filters

Out of Scope

While the following are out of scope for the initial implementation, the design should not preclude:

Handling out-of-order insertions is not in scope for this RFC, though we make sure the design accounts for the future possibility (see rejected alternatives section)
Outlining compaction strategies that are optimized for TTL (such as Cassandra’s TWCS). This will require additional metadata per-SST and a compaction strategy to match, but is otherwise possible with what’s outlined in the scope of this design.
Outlining compaction strategies that target efficient deletes (such as outlined in Lethe)
Ensuring deletions are resolved within a time bound (a TTL does not ensure the data is deleted from the object store within the specified time)
A public API for compaction filters (see https://github.com/slatedb/slatedb/issues/225) - this RFC will implement TTL using a compaction filter but will not yet expose a public API for plugging in custom compaction filters. Instead, the implementation should ensure that such an extension is easy.

Public API

We will make the following changes to DbOptions:

introduce a configuration for a custom clock, and will default to using the system clock
introduce a configuration for a default TTL, which defaults to None

Note that the clock is expected to return monotonically increasing (greater than or equal to previous) values. This will be enforced by wrapping the supplied Clock implementation that tracks the previous values returned.

/// defines the clock that SlateDB will use during this session
pub trait Clock {
    /// Returns a timestamp in milliseconds since the unix epoch,
    /// must return monotonically increasing numbers (this is enforced
    /// at runtime and will panic if the invariant is broken)
    ///
    /// Note that this clock does not need to return a number that
    /// represents the unix timestamp; the only requirement is that
    /// it represents a sequence that can attribute a logical ordering
    /// to actions on the database
    fn now(&amp;self) -&gt; i64;
}

/// contains the default implementation of the Clock, and will return the system time
pub struct SystemClock;

// ...

pub struct DbOptions {
    // ...

    /// The Clock to use for insertion timestamps
    ///
    /// Default: the default clock uses the local system time on the machine
    clock: Option&lt;Arc&lt;dyn Clock&gt;&gt;,

    /// The default time-to-live (TTL) for insertions (note that re-inserting a key
    /// with any value will update the TTL to use the default_ttl)
    ///
    /// Default: no TTL (insertions will remain until deleted)
    default_ttl: Option&lt;Duration&gt;
}

To enforce that the clock remains monotonically increasing across instantiations of the database, we will add the maximum timestamp written in both the manifest and each SST ( see the change to SsTableInfo below). When seeding the database minimum timestamp, SlateDB will use the maximum value seen in all persisted WAL SSTs and the latest manifest.

table ManifestV1 {
    // ...
    timestamp: long;
}

Db#put_put_with_options can optionally specify a row-level TTL at insertion time by leveraging newly added field in WriteOptions:

pub struct WriteOptions {
    /// The time-to-live (ttl) for this insertion. If this insert overwrites an existing
    /// database entry, the TTL for the most recent entry will be canonical.
    ///
    /// Default: the TTL configured in DbOptions when opening a SlateDB session
    pub ttl: Option&lt;Duration&gt;,
}

For this RFC we will not introduce the ability to disable filtering out TTL-expired records on retrieval. The default implementation will read the metadata for every retrieved row and filter out records where ts + tll < now. We can, as a future optimization, enable lazy deserialization of the metadata so that this doesn’t need to happen on each row retrieved if the user requests the ability to retrieve TTL’d rows that have not yet been compacted. This can be part of the API design for compaction filters (see https://github.com/slatedb/slatedb/issues/225)

Implementation

Data Format

Currently, SlateDB does not have a mechanism for backwards-compatibility of the SST data format.

We will first introduce the following fields to the flatbuffer definitions:

format_version: field that specifies the codec version that encoded this SST
row_features: an enumeration of metadata fields that are available with each encoded row. This makes it possible in the future to save space per row by disabling row attributes, but that is out of scope for this RFC
creation_timestamp: the time at which this SST file was written. For this RFC this timestamp is used to schedule follow-up compactions (see periodic compaction).
min_ts/max_ts: metadata that indicates the earliest and latest timestamp values in the SST

/// the metadata encoded with each row, note that the ordering of this enum is
/// sensitive and must be maintained
enum RowFeature: byte {
    RowFlags,
    Timestamp,
    TimeToLive,
}

table SsTableInfo {
    // ...

    // The current encoding version of the data block
    format_version: uint;

    // The metadata attributes that are encoded with each row. These attributes will be encoded
    // in order that they are declared in the RowAttributes enum
    row_features: [RowFeature];

    // The time at which this SST was created, based on the configured Clock in
    // DbOptions
    creation_timestamp: long;

    // The minimum timestamp of any row in the SST, based on the configured Clock in
    // DbOptions
    min_ts: long;

    // The maximum timestamp of any row in the SST, based on the configured Clock in
    // DbOptions
    max_ts: long;
}

This value will be written when the SST is created. We will read this back using a codec that maps 1:1 with the format version. The version 0 has the following per-value format:

|  u16    | var |    u32    |  var  |
|---------|-----|-----------|-------|
| key_len | key | value_len | value |
|---------|-----|-----------|-------|

The format_version will be bumped to 1 and will be modified to have the following schema:

|  u16    | var | var  |    u32    |  var  |
|---------|-----|------|-----------|-------|
| key_len | key | attr | value_len | value |
|---------|-----|------|-----------|-------|

The newly introduced attr field will be decoded using the row_features array specified for this SST. Which row attributes are included depend on the data that is being written. If any row in the SST has a ttl, the TimeToLive feature will be enabled.

As part of this proposal, we will no longer represent tombstones as INT_MAX value_len and instead have an indicator byte in the attribute array which specifies the RowFlags. RowFlags is a bitmask of descriptors on the row. The first bit in RowFlags is set to 1 if the row is a tombstone. Other bits are reserved for future flags. Later we can add other types of row attributes as necessary (since flatbuffers does not support single-bit values, a byte is the smallest amount of data we can store here):

                |-- attr --|
|  u16    | var |   byte   |    u32    |  var  |
|---------|-----|----------|-----------|-------|
| key_len | key | row_flag | value_len | value |
|---------|-----|----------|-----------|-------|

In cases where both Timestamp and TimeToLive are enabled, the row will look like:

                |-------- attr --------------|
|  u16    | var |    byte  | i64 |    i64    |    u32    |  var  |
|---------|-----|----------------------------|-----------|-------|
| key_len | key | row_flag | ts  | expire_ts | value_len | value |
|---------|-----|----------|-----|-----------|-----------|-------|

Both ts and expire_ts are encoded as i64 in milliseconds.

The metadata will be returned in the decoded KeyValueDeletable which will be modified:

pub struct KeyValueDeletable {
    pub key: Bytes,
    pub value: ValueDeletable,
    pub attributes: RowAttributes,
}

pub struct RowAttributes {
    pub ts: Option&lt;i64&gt;,
    pub expire_ts: Option&lt;i64&gt;,
}

Compaction

TTL Compaction Filter

We will introduce the concept of a compaction filter, which serves as a mechanism for pluggable participation in the compaction process. Specifically, the filter (if configured) will run on each key during the compaction process and optionally remove or modify the value. This is independent of the periodic compaction interval specified in the following section.

For the scope of this RFC, we will be introducing a builtin TS/TTL compaction filter. This filter will add a tombstone to any value where db.clock.now() > kv.attributes.expire_ts. Note that tombstones will still contain a timestamp see out of order insertions. If a tombstone is added, the TTL (expire_ts) for the row will be cleared, indicating that a future run of this filter should not remove the tombstone.

Since this RFC does not cover the public API for compaction filters, this filter will simply run on any SST that specifies row_features: [Timestamp && TimeToLive].

Periodic Compaction

If the SST is written with the TimeToLive row attribute enabled, the compaction scheduler will compare the SST’s SsTableInfo#timestamp with a newly introduced compaction config for periodic compaction time limits:

use std::time::Duration;

pub struct CompactorOptions {
    // ...

    // When there are SSTs in L0 (or Sorted Runs with SSTs in other levels) older than
    // `periodic_compaction_interval`, SlateDB will include them in the next scheduled
    // compaction -- this is a best effort interval and slateDB does not guarantee that
    // compaction will happen within this interval (implementations of the compaction scheduler
    // are free to implement jitters or other mechanisms to ensure that SSTs are not
    // all compacted simultaneously)
    periodic_compaction_interval: Duration
}

This will ensure that Sorted Runs or SSTs that are not otherwise scheduled by the compaction scheduler will be picked up and TTL / tombstones will be compacted. This is particularly relevant for the lower levels of the LSM tree.

Rejected Alternatives & Follow-Ups

Insertion Timestamps

A previous version of this proposal introduced the concept of row level timestamps — the ability to set a timestamp at each insertion instead of only specifying a custom clock. This introduced the ability to have out-of-order insertions, which in turn caused significant downstream complications.

In order to leave the door open for timestamps on insertion, we encode the clock time with each row in the SST as opposed to only maintaining the time that it should expire. This comes at the cost of an extra 4 bytes when cached in memory (compression will probably mean it is stored as fewer bytes).

Another design consideration in the RFC that leaves the door open for insertion timestamps is the association of a timestamp with tombstones. This is to handle the scenario where a record with a TTL is inserted with a lower timestamp than an existing record. Consider the following:

seq0: Insert K1 with TS=100 and TTL=100000
seq1: Insert K1 with TS=50 and TTL=100

If seq0 and seq1 result in writes in different levels of the LSM tree, when a compaction kicks in at TS=170 a tombstone is generated for K1. This tombstone should not result in a deletion of the original seq0 insert as it logically happened “after” seq1.

Updates

September 25, 2024

Removed out-of-order insertions
Removed the functionality to retrieve metadata of a row. Without multiple copies of the row the insertion time is less interesting. we can consider adding this functionality back in upon request.
Added the requirement for the clock to be monotonically increasing
Added a timestamp field to ManifestV1 to seed clock
Added a creation_timestamp field to SsTableInfo and periodic_compaction_seconds to the CompactorOptions
Added a RowType metadata feature and used it to replace the current encoding of tombstones

October 1, 2024

Changed RowType to RowFlags and indicated that the first bit is used to identify tombstones
Added min_ts/max_ts into SsTableInfo
Updated the seeding of the clock min timestamp to look at the WAL SSTs as well as the manifest
Modified the stored value in the SST from ttl to expire_ts
Marked periodic_compaction_interval as “best effort” to allow for implementations to optimize performance characteristics such as adding a jitter or max number of SSTs to compact in one go
Many misc. improvements to readability and clarifications

October 6, 2024

Minor changes and feedback. Most notable is renaming “meta features” to “row attributes”