Range Scans

Status: Accepted

Authors:

Jason Gustafson

References:

Motivation

The ability to scan the keys in a range is an essential feature for building applications. For example, this can be used to implement a collections of keys using a common prefix. Currently, SlateDB only supports fetching individual values for known keys.

Goals

The goal of this work is to provide a flexible capability to query a range of keys in the database. It should be possible to specify unbounded ranges to allow querying all keys, or bounded ranges for finer precision and efficiency.

Range queries have the potential to be expensive from a resource perspective because they hold references to existing tables, which prevents garbage collection. Our goal here is to provide a simple initial approach to state management which optimizes for common cases, while leaving room for further optimization in the future.

Public API

We introduce a new Db::scan API which accept a DbRange defining the range of the scan. The result will be a DbIterator object which iterates the current records within the range. Users may either provide an explicit starting range in the call to scan, or they may seek using the returned iterator.

impl Db {
    /// Scan a range of keys.
    ///
    /// returns a `DbIterator`
    pub async fn scan&lt;T&gt;(
        &amp;self,
        range: T
    ) -&gt; Result&lt;DbIterator, SlateDbError&gt; where
        T: RangeBounds&lt;Bytes&gt;
    {
      ...
    }

}

impl DbIterator {
    /// Get the next record in the scan.
    ///
    /// returns Ok(None) when the scan is complete
    /// returns Err(InvalidatedIterator) if the iterator has been invalidated
    ///  due to an underlying error
    pub async fn next(
        &amp;mut self,
    ) -&gt; Result&lt;Option&lt;KeyValue&gt;, SlateDbError&gt; {
        ...
    }

    /// Seek to a key ahead of the last key returned from the iterator or
    /// the lower range bound if no records have yet been returned.
    ///
    /// returns Ok(()) if the position is successfully advanced
    /// returns SlateDbError::InvalidArgument if `lower_bound` is `Unbounded`
    /// returns SlateDbError::InvalidArgument if the key is comes before the
    ///  current iterator position
    /// returns SlateDbError::InvalidArgument if `lower_bound` is beyond the
    ///  upper bound specified in the original `scan` parameters
    /// returns Err(InvalidatedIterator) if the iterator has been invalidated
    ///  in order to reclaim resources
    pub async fn seek(
        &amp;mut self,
        next_key: Bytes,
    ) -&gt; Result&lt;(), SlateDbError&gt; {
        ...
    }

}

pub struct KeyValue {
    pub key: Bytes,
    pub value: Bytes,
}

Note that the DbIterator may be invalidated internally if any of its referenced resources must be reclaimed. In this case, a call to DbIterator::next or DbIterator::scan will return SlateDb::InvalidatedError. This is discussed in more detail below.

We will also provide Db::scan_with_options, which allows the user to specify the query isolation through a ScanOptions struct. Similar to get_with_options, the two isolation levels that will be supported are Committed and Uncommitted. The default when scan is called without explicit options will be Committed.

impl Db {
    /// Scan a range of keys with the provided options.
    ///
    /// returns a `DbIterator`
    pub async fn scan_with_options&lt;T&gt;(
        &amp;self,
        range: T,
        options: &amp;ScanOptions,
    ) -&gt; Result&lt;DbIterator, SlateDbError&gt; where
        T: RangeBounds&lt;Bytes&gt;
    {
      ...
    }
}

pub struct ScanOptions {
    /// The read commit level for read operations
    pub read_level: ReadLevel,
    /// The number of bytes to read ahead
    pub read_ahead_bytes: usize,
    /// Whether or not fetched blocks should be cached
    pub cache_blocks: bool
}

Implementation

A scan must iterate every table which may have keys within the query range. This includes tables in memory which which have not been flushed and L0/SR tables which may need to be loaded from object storage. Internally, we will rely on iterators which sort and merge these tables. The latest records will have precedence following the same rules that a normal get query follows.

We will rely on the latest checkpoint state in order to obtain references to memtables, SSTs, and sorted runs needed for the query. The DbIterator will expose the UUID of the checkpoint that it was generated from. The DbIterator will only maintain references to tables which cover the range specified in the scan parameters.

Note that if the DbReader was created with a reference to a specific checkpoint, then the DbIterator will only refer to that checkpoint.

The existing MergeIterator used for compaction provides much of the necessary foundation to unify multiple iterators. We will build an iterator which sorts and merges memtables (e.g. with MemTableIterator), L0 tables (SstIterator), and sorted runs (SortedRunIterator).

Query Isolation: When using the Uncommitted isolation, unflushed records recorded in WALs will be used as a source for the key range. Otherwise, only committed records in memtables, as well as L0 and SR tables will be used.

Range queries must access mutable table state, such as the current WAL. It is therefore possible for the query results to include some records which were written after the query began scanning its range of records. For example, a query may include only a subset of records that were inserted in the same WriteBatch. In other words, range queries will not have snapshot isolation initially.

When a Db or DbReader is closed, active iterators will be invalidated. This will cause calls to DbIterator::next or DbIterator::seek to fail with an InvalidatedIterator error which indicates that the database was shutdown.

RFC-???? proposes transactional semantics for SlateDB. This relies on a monotonic sequence number which is tagged on each key update. Range queries could leverage this sequence number to improve isolation by filtering any updates with sequence numbers larger than the latest at the time the query was executed. We leave this as a gap for future work to address.

Checkpoint References

We intend to rely on state checkpoints introduced in RFC-0004 in order to prevent the cleanup of durable state referenced by the query. When a range scan is created, it will create an internal reference to the current checkpoint. This reference will prevent cleanup of the checkpoint state until the scan is complete.

Checkpoints created by DbReaders have an expiry time, which is only refreshed as long as long as the checkpoint remains current. If a DbIterator instance remains active long enough for its referenced checkpoint to expire, then the DbIterator may eventually encounter a NotFound error from the object store. This error will be propagated to the next invocation of DbIterator::next or DbIterator::seek. After encountering this error, the iterator will be invalidated. Invalidated iterators will return SlateDbError::InvalidatedIterator on any subsequent call to DbIterator::next or DbIterator::seek.

In general, we make the initial simplifying assumption that scans will not be frequent enough or last long enough for checkpoint accumulation to be a problem. In the future, we could either limit the number of checkpoint references that are available for scanning or we could limit the time that a DbIterator will remain valid.

Memory Usage

In addition to the checkpoint, a range scan may hold references to records in unflushed WALs or memtables. A potential optimization is to pre-merge these tables in order to filter the records within the scan range. This can be useful when targeting a small range of keys with the scan. This is left for future work.

Iterators will use prefetching to improve performance. Internally, this is already implemented by SstIterator. Users can influence prefetching to improve performance using ScanOptions::read_ahead_size, which indicates the number of bytes to prefetch. Internally, this is used to set SstIterator::blocks_to_fetch. The default read_ahead_size will be the size of a single block.

Users can also influence caching behavior using ScanOptions::cache_blocks. Internally, this will be used to set SstIterator::cache_blocks. The default behavior will be to use cache_blocks=true. Note that while an iterator is actively scanning from a given block, it will remain pinned in memory.

Rejected Alternatives & Follow-Ups

Improved isolation: Transactional semantics proposed in RFC-???? will provide the foundation to support snapshot isolation or serializability of range scans. We anticipate future work to address the existing gaps.

Space optimizations:

Rather than retaining references to in-memory tables for the full duration of the query, we can filter the set of records within the query’s range. As new records are returned, previous ones fall out of scope and may be cleaned up. This makes sense for bounded queries where the number of matching keys within the WAL or memtables may be low. For an unbounded query, we may prefer to hold a reference to the full table.
While a range scan is in progress, some of its in-memory table references may get flushed to storage. It would be possible to substitute these references in order to allow the in-memory state to be cleaned up. This might make sense when the rate of updates is high enough that memory must be aggressively reclaimed.

Resource Limits: As mentioned above, it is possible to impose limits on active range scans in order to ensure that they do not exhaust memory or create unnecessary garbage accumulation.

Persistent Queries It may be useful in some use cases to persist query state in order to allow a process to resume after a migration or restart. This may require exposing the checkpoint that the query results were fetched from through the DbIterator.