Prefix Bloom Filters via Pluggable Filter Policies

Status: Accepted

Authors:

Hussein Nomier

Summary

This RFC proposes adding prefix bloom filters to SlateDB, enabling prefix scans to skip SSTs that don’t contain matching keys. To support this cleanly, the hard-coded bloom filter is replaced with a pluggable filter policy abstraction where users can supply their own filter implementation (e.g., cuckoo filters, range filters) through traits. The existing bloom filter becomes the default implementation behind this trait, preserving full backwards compatibility.

Motivation

SlateDB currently uses a bloom filter that hashes full keys. This is effective for point lookups (get) but provides no benefit for prefix scans, which are a critical access pattern for many applications (see issue #1334 and #1302).

Today, a prefix scan must check all sorted runs and any SSTs whose key range could overlap, even though many of those SSTs contain no keys with the target prefix. A prefix-aware filter could eliminate these unnecessary SST reads.

There are many filter designs that could help here — each with different trade-offs. Rather than committing to one, we propose a pluggable filter policy that lets SlateDB ship with a default (the existing bloom filter) and an optional built-in prefix bloom filter, while enabling users to plug in specialized filters without modifying the engine.

Goals

Define traits that abstract filter construction, serialization, and querying.
Refactor the existing bloom filter as the default implementation.
Support prefix bloom filters by allowing a user-defined key transformation before hashing.
Support future filter implementations without engine changes.
Support multiple concurrent filter policies per SST with AND logic evaluation.
Enable safe migration from one filter policy to another without losing filtering on existing SSTs.
Maintain full backwards compatibility: existing databases with bloom filters continue to work.
Support both point-lookup filtering and prefix-scan.

Non-Goals

Shipping additional filter implementations beyond the existing bloom filter and a prefix bloom filter.

Design

Traits

The core abstraction is a FilterPolicy trait with associated builder and filter types.

/// A named, configurable filter policy.
pub trait FilterPolicy {
    /// An identifier for this policy. Stored per-SST so the reader knows
    /// whether it can decode and use the filter block.
    ///
    /// The name should encode anything that affects **compatibility** —
    /// i.e., anything that would make a filter unreadable or produce wrong
    /// results if mismatched. For the built-in bloom filter:
    /// - `bits_per_key` is not included — changing it affects quality (FP
    ///   rate) but not the encoding format, so any reader can decode any
    ///   bloom filter regardless of bits_per_key.
    /// - `prefix_extractor` name is included — it changes which hashes are
    ///   stored, so querying with a different extractor produces false
    ///   negatives.
    ///
    /// Examples: `"_bf"`, `"_bf:p=fixed3"`.
    fn name(&self) -> &str;

    /// Creates a new builder for constructing a filter.
    fn builder(&self) -> Box<dyn FilterBuilder>;

    /// Decodes a previously encoded filter.
    ///
    /// The engine matches the policy name stored in the composite filter
    /// block against `self.name()` before calling this to ensure the policy
    /// can deserialize the data.
    fn decode(&self, data: &[u8]) -> Arc<dyn Filter>;

    /// Estimates the encoded size in bytes for a filter with `num_keys` keys.
    ///
    /// This is a hint used by the SST builder to reserve buffer space before
    /// the filter is built. It does not need to be exact.
    ///
    /// For the built-in bloom filter, the actual filter is sized from the
    /// real number of hashes collected during `add_entry`, not from this estimate,
    /// so overestimates waste a small allocation and underestimates just trigger
    /// a reallocation.
    fn estimate_size(&self, num_keys: usize) -> usize;
}

/// Accumulator for entries during SST construction that produces a [Filter].
pub trait FilterBuilder {
    /// Feeds an SST entry to the filter being built.
    ///
    /// The builder receives the full `RowEntry` (key, value, sequence
    /// number, timestamps) so that filter implementations are not limited
    /// to key-only hashing.  A bloom filter will typically hash only the
    /// key (or a prefix of it), but other filters — for example a min/max
    /// timestamp filter or a sequence-number range filter — can inspect
    /// whichever fields they need.
    fn add_entry(&mut self, entry: &RowEntry);

    /// Finalizes and returns the completed filter.
    fn build(&mut self) -> Arc<dyn Filter>;
}

/// A read-only filter that answers membership queries.
pub trait Filter {
    /// Returns `true` if the filter cannot rule out the query.
    /// A return value of `false` guarantees no matching key exists.
    ///
    /// Filters that cannot answer a particular query kind should return
    /// `true` to avoid false negatives.
    fn might_match(&self, query: &FilterQuery) -> bool;

    /// Serializes the filter into the provided buffer.
    fn encode(&self, writer: &mut dyn BufMut);

    /// Returns the size of the filter's data in bytes.
    ///
    /// This should reflect the underlying data structure size (e.g., the
    /// bit array length for a bloom filter). Can be used for memory
    /// tracking, cache accounting, etc.
    fn size(&self) -> usize;

    /// Returns a copy with over-allocated memory reclaimed.
    ///
    /// Used by the block cache to reduce memory waste from `Bytes` slices
    /// that reference larger underlying buffers.
    fn clamp_allocated_size(&self) -> Arc<dyn Filter>;
}

/// A membership query passed to [`Filter::might_match`].
pub struct FilterQuery {
    /// The target of this query (a specific key or a prefix).
    pub target: FilterTarget,
    /// Opaque, caller-supplied context for custom filters, forwarded from
    /// `ScanOptions::filter_context` or `ReadOptions::filter_context`.
    ///
    /// Custom filters decode this according to their own byte layout. For
    /// example, a min/max version filter might expect two big-endian `u64`s
    /// packed into the `Inline` variant. The built-in bloom filter ignores
    /// this field entirely.
    pub context: Option<FilterContext>,
}

/// The target of a filter query.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum FilterTarget {
    /// Used to test whether a specific key might exist in the SST.
    Point(Bytes),
    /// Used to test whether any key with the given prefix might exist in the SST.
    Prefix(Bytes),
}

/// Caller-supplied opaque context. Carries raw bytes that a custom filter
/// decodes itself.
///
/// Marked `#[non_exhaustive]` so new variants can be added (e.g., a heap
/// `Bytes` variant for larger payloads, or typed variants) as concrete
/// user use cases emerge, without it being a breaking change.
#[derive(Clone, Debug)]
#[non_exhaustive]
pub enum FilterContext {
    /// Inline 64-byte payload with no heap allocation. Suitable for
    /// pairs of `u64`s, `u128`s, and other fixed-layout small structs.
    Inline([u8; 64]),
}

Key design decisions:

FilterQuery with opaque context: FilterQuery carries an optional FilterContext forwarded from ScanOptions / ReadOptions. Custom filters decode it to their expected layout; when it is None or undecodable, they return true. The built-in bloom filter ignores it.
Filter applicability is expressed through might_match returning true: Rather than adding a separate applies_to or can_answer method, filters express inapplicability by returning true from might_match — meaning “I cannot rule this out.” This keeps the trait surface minimal and works uniformly for all filter types:
- Build time: FilterBuilder::add_entry receives the full RowEntry. Builders that don’t care about certain entries simply ignore them (e.g., a timestamp-range filter ignores entries without timestamps, a prefix bloom filter ignores entries whose key doesn’t match the extractor).
- Read time (get): might_match receives FilterTarget::Point with the full key. Filters that can’t answer point queries return true.
- Read time (scan_prefix): might_match receives FilterTarget::Prefix with only the prefix. Filters that can’t answer prefix queries return true. For bloom filters with a PrefixExtractor, the extractor is consulted inside might_match — if the scan prefix has no extractable prefix, the filter returns true rather than risk a false negative.
This means the engine never needs to know which filters apply to which queries. It always evaluates all filters with AND logic and trusts each filter to return true when it cannot answer.
name() for safety: Each policy’s name is stored alongside its filter data in the composite filter block. When reading an SST, if a stored name doesn’t match any configured policy, that sub-filter is skipped rather than misinterpreted. This allows safe policy migration without manifest-level validation (same pattern as LevelDB’s FilterPolicy::Name()).
FilterTarget enum: Instead of just accepting a hash, the might_match method takes a FilterQuery whose target field distinguishes point lookups and prefix lookups. Filters that only support point lookups can conservatively return true for prefix queries.
encode on Filter, decode on FilterPolicy: Encoding is on the Filter instance because it knows its own internal representation. Decoding is on the FilterPolicy because the caller needs the policy (which knows the format) to reconstruct a Filter from raw bytes.

Built-in Implementation: `BloomFilterPolicy`

A single BloomFilterPolicy handles both full-key and prefix filtering using one bloom filter per SST. Both full-key hashes and prefix hashes are added to the same bit array.

`PrefixExtractor`

PrefixExtractor is specific to BloomFilterPolicy — it is not part of the core FilterPolicy/FilterBuilder/Filter traits. Custom filter policies that need prefix-based filtering can implement their own selection logic directly in add_entry/might_match.

/// Extractor for a prefix from a byte string, used to build and probe
/// prefix-based bloom filters.
///
/// The extractor's output is always a *prefix* of its input — if
/// `prefix_len(target)` returns `Some(n)`, then the bytes inside `target`
/// sliced to `..n` are the extracted prefix.
///
/// The `target` argument distinguishes two different semantic questions:
///
/// - [`FilterTarget::Point`] — the input is a complete key (a stored key
///   during SST construction, or the target of a point lookup).
///   `prefix_len` returns the extraction length that was / will be hashed
///   into the filter. **Invariant**: if `prefix_len(Point(k)) = Some(n)`,
///   then for every `k'` with `k'[..n] == k[..n]`, also
///   `prefix_len(Point(k')) = Some(n)`. The extraction depends only on
///   the first `n` bytes.
///
/// - [`FilterTarget::Prefix`] — the input is a scan prefix. `prefix_len`
///   returns `Some(n)` only if probing with the first `n` bytes is safe
///   for *every* possible extension of the input. **Invariant**: if
///   `prefix_len(Prefix(p)) = Some(n)`, then for every extension `q` of
///   `p`, `prefix_len(Point(q)) = Some(n)` and `q[..n] == p[..n]`. When
///   this cannot be guaranteed (e.g., the extractor inspects later bytes
///   of the key, as a "last-delimiter" extractor would), the extractor
///   must return `None` for the `Prefix` variant.
///
/// For most extractors — fixed-length, first-delimiter, anchored-prefix —
/// the two variants return the same answer. The split matters for
/// extractors whose extraction depends on the full input; those can still
/// be used for the build and point paths while conservatively disabling
/// prefix-scan filtering.
pub trait PrefixExtractor {
    /// A unique name identifying this extractor's configuration.
    ///
    /// Changing the extractor (e.g. switching from a 4-byte fixed prefix
    /// to a delimiter-based one) changes which hashes are stored in the
    /// filter, so existing filters become invalid. `BloomFilterPolicy`
    /// includes this name in the policy name it writes to SST metadata
    /// (e.g. `"_bf:p=fixed3"`), which lets the reader
    /// detect the mismatch and skip the filter instead of returning wrong
    /// results.
    fn name(&self) -> &str;

    /// Returns the length `n` such that the bytes of `target` sliced to
    /// `..n` are the extracted prefix, or `None` if this target has no
    /// extractable prefix under this extractor.
    ///
    /// Called on three paths, distinguished by the variant of `target`:
    /// - **Build time** — policy wraps each stored key in
    ///   `FilterTarget::Point` and hashes `key[..n]` into the filter when
    ///   this returns `Some(n)`.
    /// - **Point reads** — policy forwards the incoming
    ///   `FilterTarget::Point` and probes with `hash(key[..n])`.
    /// - **Prefix reads** — policy forwards the incoming
    ///   `FilterTarget::Prefix` and probes with `hash(prefix[..n])`; a
    ///   `None` result causes the filter to be skipped so no false
    ///   negative can occur.
    ///
    /// **Worked example.** A 3-byte fixed extractor with an SST
    /// containing keys `abc_1`, `abc_2`, `abx_1` stores hashes of `abc`
    /// and `abx`. Then:
    /// - `Prefix("ab")` → `None` (2 < 3; filter skipped).
    /// - `Prefix("abc")` → `Some(3)` → probe `hash("abc")`.
    /// - `Prefix("abcd")` → `Some(3)` → probe `hash("abc")` (truncation
    ///   safe by the `Prefix` invariant).
    /// - `Point("abc_1")` → `Some(3)` → probe `hash("abc")`.
    ///
    /// For a *last-delimiter* extractor (extract up to and including the
    /// last `:`), the `Prefix` variant must return `None` always — the
    /// last delimiter's position can change as bytes are appended, so no
    /// scan prefix is safe to probe. The `Point` variant still returns
    /// the position correctly for complete keys.
    fn prefix_len(&self, target: &FilterTarget) -> Option<usize>;
}

The policy would look like:

pub struct BloomFilterPolicy {
    bits_per_key: u32,

    /// When true, full-key hashes are added to the filter. Point lookups
    /// (`get`) probe the filter with the full-key hash.
    /// Default: true.
    whole_key_filtering: bool,

    /// When set, prefix hashes are also added to the filter. Prefix scans
    /// probe the filter with the prefix hash.
    /// Default: None (no prefix filtering).
    prefix_extractor: Option<Arc<dyn PrefixExtractor>>,

    name: String,
}

impl BloomFilterPolicy {
    pub const NAME: &'static str = "_bf";

    pub fn new(bits_per_key: u32) -> Self {
        Self {
            bits_per_key,
            whole_key_filtering: true,
            prefix_extractor: None,
            name: Self::NAME.to_string(),
        }
    }

    /// Configures a prefix extractor for prefix-based bloom filtering.
    ///
    /// When set, the bloom filter hashes both extracted prefixes and (if
    /// `whole_key_filtering` is enabled) full keys. Prefix scans can then
    /// probe the filter to skip SSTs that contain no matching prefixes.
    ///
    /// The extractor's name is included in the policy name to ensure that
    /// filters built with different extractors are not mismatched.
    pub fn with_prefix_extractor(mut self, extractor: Arc<dyn PrefixExtractor>) -> Self {
        self.name = format!("{}:p={}", Self::NAME, extractor.name());
        self.prefix_extractor = Some(extractor);
        self
    }

    /// Controls whether full keys are hashed into the bloom filter.
    ///
    /// When `true` (the default), point lookups can use the filter. Set
    /// to `false` if only prefix scans are needed and you want to reduce
    /// filter size.
    pub fn with_whole_key_filtering(mut self, enabled: bool) -> Self {
        self.whole_key_filtering = enabled;
        self
    }
}

impl FilterPolicy for BloomFilterPolicy {
    fn name(&self) -> &str { &self.name }

    fn builder(&self) -> Box<dyn FilterBuilder> {
        Box::new(BloomFilterBuilder::new(
            self.bits_per_key,
            self.whole_key_filtering,
            self.prefix_extractor.clone(),
        ))
    }

    fn decode(&self, data: &[u8]) -> Arc<dyn Filter> {
        Arc::new(BloomFilter::decode(
            data,
            self.whole_key_filtering,
            self.prefix_extractor.clone(),
        ))
    }

    fn estimate_size(&self, num_keys: usize) -> usize {
        BloomFilter::estimate_encoded_size(num_keys, self.bits_per_key)
    }
}

Configuration

The filter_bits_per_key field is removed from Settings. Filter policies are configured on DbBuilder and CompactorBuilder via with_filter_policies, following the same pattern as merge_operator and block_transformer which are also trait-object-based configuration that lives on the builders rather than the serializable Settings struct.

Settings keeps min_filter_keys (a plain u32 that is serializable) since it controls whether a filter is created, not which filter:

pub struct Settings {
    // ... existing fields ...

    /// Write SSTables with a filter if the number of keys in the SSTable
    /// is greater than or equal to this value.
    pub min_filter_keys: u32,

    // filter_bits_per_key is removed — replaced by filter_policies on
    // DbBuilder / CompactorBuilder.
}

DbBuilder and CompactorBuilder gain a with_filter_policies method:

impl<P: Into<Path>> DbBuilder<P> {
    /// Sets the filter policies for this database. Each policy produces a
    /// separate filter per SST, stored in a composite filter block. On
    /// read, all filters are evaluated with AND logic — an SST is skipped
    /// if any filter returns `false`.
    ///
    /// Defaults to `vec![Arc::new(BloomFilterPolicy::new(10))]`.
    /// Pass an empty vec to disable filters.
    pub fn with_filter_policies(
        mut self,
        policies: Vec<Arc<dyn FilterPolicy>>,
    ) -> Self {
        self.filter_policies = policies;
        self
    }
}

impl<P: Into<Path>> CompactorBuilder<P> {
    /// Sets the filter policies used when the compactor rewrites SSTs.
    /// Must match the writer's policies to avoid silently dropping filters.
    pub fn with_filter_policies(
        mut self,
        policies: Vec<Arc<dyn FilterPolicy>>,
    ) -> Self {
        self.filter_policies = policies;
        self
    }
}

Both ScanOptions and ReadOptions get a filter_context field for passing opaque context to custom filters:

pub struct ScanOptions {
    // ... existing fields ...

    /// Opaque context passed to custom filter policies at query time.
    /// Built-in filters (including the default bloom filter) ignore this
    /// field, so leaving it `None` is always safe.
    pub filter_context: Option<FilterContext>,
}

pub struct ReadOptions {
    // ... existing fields ...

    /// Opaque context passed to custom filter policies at query time.
    /// See `ScanOptions::filter_context`.
    pub filter_context: Option<FilterContext>,
}

The two-variant shape lets callers choose their byte encoding:

Inline([u8; 64]) covers the common fixed-layout case (e.g., a pair of u64s for a min/max version filter) with zero heap allocation.
Bytes(Bytes) covers larger or variable-length payloads.

Configuration modes (for BloomFilterPolicy):

`whole_key_filtering`	`prefix_extractor`	Hashes stored	`get`	`scan_prefix`
`true` (default)	`None`	full-key	full-key hash	not filtered
`true`	`Some(...)`	full-key + prefix	full-key hash (tighter)	extracted prefix hash
`false`	`Some(...)`	prefix only	extracted prefix hash	extracted prefix hash

The third row — prefix-only storage with point lookups also probing the extracted prefix — fits workloads whose keys follow a GroupId ‖ Suffix pattern, where every query (scan or point) is answered by matching on the group id. The filter is sized against the number of distinct group ids (small), and point lookups reuse those same hashes instead of paying to store a full-key hash per row. If the extracted prefix is not in the filter, the full key cannot be in the SST either, so the probe is safe.

Usage:

// Default: full-key bloom filter (backwards compatible, no explicit call needed)
let db = Db::builder("path", object_store).build().await?;

// Full-key + prefix bloom filter (recommended for prefix scan workloads)
let db = Db::builder("path", object_store)
    .with_filter_policies(vec![Arc::new(BloomFilterPolicy::new(10)
        .with_prefix_extractor(Arc::new(MyPrefixExtractor::new())))])
    .build()
    .await?;

// Prefix-only bloom filter. Point lookups probe the filter with the
// extracted prefix of the queried key; scan_prefix probes with the
// extracted prefix of the scan target. No full-key hashes are stored.
let db = Db::builder("path", object_store)
    .with_filter_policies(vec![Arc::new(BloomFilterPolicy::new(10)
        .with_prefix_extractor(Arc::new(MyPrefixExtractor::new()))
        .with_whole_key_filtering(false))])
    .build()
    .await?;

// Multiple filters: bloom + custom min/max filter
let db = Db::builder("path", object_store)
    .with_filter_policies(vec![
        Arc::new(BloomFilterPolicy::new(10)
            .with_prefix_extractor(Arc::new(MyPrefixExtractor::new()))),
        Arc::new(MyMinMaxFilterPolicy::new(...)),
    ])
    .build()
    .await?;

// Passing context to custom filters at scan time. For a min/max version
// filter that expects two big-endian u64s, the Inline variant gives a
// no-heap-allocation path even when bounds change per scan:
let mut payload = [0u8; 64];
payload[..8].copy_from_slice(&min_version.to_be_bytes());
payload[8..16].copy_from_slice(&max_version.to_be_bytes());
let scan_options = ScanOptions::default()
    .with_filter_context(Some(FilterContext::Inline(payload)));

Write path: building the filter

During SST construction, the builder adds both hashes to the same underlying bloom filter. Consecutive keys sharing the same prefix only store the prefix hash once (deduplication), keeping the overhead minimal.

Read path: querying the filter

The same filter is probed with different hashes depending on the query type:

impl Filter for BloomFilter {
    fn might_match(&self, query: &FilterQuery) -> bool {
        match &query.target {
            FilterTarget::Point(key) if self.whole_key_filtering => {
                // Full-key hash gives the tightest answer when available.
                self.might_contain(filter_hash(key.as_ref()))
            }
            target => {
                // Otherwise defer to the extractor. For `Point`, this is
                // the fallback when whole-key filtering is disabled —
                // we probe with the extracted prefix of the queried key.
                // For `Prefix`, the extractor answers whether the scan
                // prefix is safe to probe (returning `None` when the
                // extractor's extraction depends on bytes beyond the
                // scan prefix, e.g., a last-delimiter extractor).
                let Some(ref extractor) = self.prefix_extractor else {
                    return true;
                };
                let Some(n) = extractor.prefix_len(target) else {
                    return true;
                };
                let bytes = match target {
                    FilterTarget::Point(k) => k.as_ref(),
                    FilterTarget::Prefix(p) => p.as_ref(),
                };
                self.might_contain(filter_hash(&bytes[..n]))
            }
        }
    }
}

SST Format Changes

The SST gains a filter_format enum field to distinguish the new composite block format from the legacy single-bloom format:

enum FilterFormat: byte {
    // Single raw bloom filter bytes (pre-composite format).
    Legacy,
    // Block containing one or more named filters, each prefixed with its
    // name and length.
    Composite
}

table SsTableInfo {
    // ... existing fields (filter_offset, filter_len, etc.) ...

    // Filter block format. Defaults to Legacy for backwards compatibility.
    filter_format: FilterFormat;
}

When filter_format is Composite, the filter block at filter_offset/filter_len contains a list of named filters:

[num_filters: u16]
[name_len: u16][name: bytes][data_len: u64][data: bytes]   // filter 0
[name_len: u16][name: bytes][data_len: u64][data: bytes]   // filter 1
...

Each sub-filter self-identifies by its policy’s name(). Even with a single policy, the block uses this format (a list of one).

Reading logic (backwards compatibility)

When reading an SST’s filter block:

FilterFormat::Legacy: The filter block contains raw bloom filter bytes. This is the FlatBuffer default, so SSTs written by older versions are automatically treated as Legacy. The bytes are decoded via the configured BloomFilterPolicy. If no bloom policy is configured, the filter is skipped.
FilterFormat::Composite: The filter block is a composite containing one or more named sub-filters. Each (name, data) entry is matched against the configured filter_policies by name() and decoded via policy.decode(data). Unrecognized names are skipped.

Write Path Integration

The SST builder receives the filter_policies array. For each policy, it calls builder() to obtain a FilterBuilder. Each key is fed to all builders via add_entry(). On finalization, each builder produces a filter via build(), and all filters are encoded into a composite filter block. The SST info’s filter_format is set to Composite.

Read Path Integration

The current read path uses filter_hash() and BloomFilter::might_contain() directly. We replace this with the FilterQuery abstraction and AND-logic evaluation across all filters.

Point lookups (get):

The BloomFilterEvaluator in sst_iter.rs currently hashes the key and calls might_contain. With the pluggable design:

// Before:
let key_hash = filter::filter_hash(self.key.as_ref());
filter.might_contain(key_hash)

// After:
let query = FilterQuery::point(self.key.clone());
// AND logic: skip SST if ANY filter returns false
filters.iter().all(|f| f.might_match(&query))

Prefix scans:

SlateDB already provides scan_prefix and scan_prefix_with_options methods that accept a prefix. Internally, the prefix will be wired through the reader and iterator chain so each SST’s filters can be checked before opening it, skipping SSTs where any filter returns false.

When a prefix_extractor is configured on the BloomFilterPolicy, prefix scans probe the bloom filter with filter_hash(scan_prefix[..n]) where n = extractor.prefix_len(FilterTarget::Prefix(scan_prefix)). The extractor’s Prefix-variant contract guarantees that this probe is safe for every extension of the scan prefix; when the extractor cannot offer that guarantee (e.g., a last-delimiter extractor), it returns None and the filter is skipped. The default configuration (no prefix_extractor) also returns true for prefix queries, so no filtering. Other filters in the array may still reject the SST based on context or other criteria.

With whole_key_filtering = false and a prefix_extractor configured, point lookups (get) also benefit from the filter without any additional storage: might_match(Point(k)) probes with hash(k[..extractor.prefix_len(FilterTarget::Point(k))]). This fits GroupId ‖ Suffix key schemas: the extractor matches the group id and hashing the suffix contributes no extra pruning power, since any point query already knows which group it targets.

Note: prefix filtering alone is not ideal for recency access patterns where the caller only needs a recent entry for a prefix. See Recency-Based Iterator in Future Enhancements.

Impact Analysis

SlateDB features and components that this RFC interacts with. Check all that apply.

Core API and Query Semantics

Basic KV API (get/put/delete)
Range queries, iterators, seek semantics
Range deletions
Error model, API errors

Consistency, Isolation, and Multi-Versioning

Transactions
Snapshots
Sequence numbers

Time, Retention, and Derived State

Time to live (TTL)
Compaction filters
Merge operator
Change Data Capture (CDC)

Metadata, Coordination, and Lifecycles

Compaction

Storage Engine Internals

Ecosystem and Operations

CLI tools
Language bindings (Go/Python/etc)
Observability (metrics/logging/tracing)

Operations

Performance and Cost

Point lookup latency: Negligible change. The dynamic dispatch overhead of dyn Filter is a single vtable lookup per SST which shouldn’t be a bottleneck compared to I/O costs, so the default bloom filter performs identically to today.
Prefix scan latency: Significant improvement when a prefix-aware filter is configured. SSTs that don’t contain the target prefix are skipped entirely, avoiding block reads.
Write throughput: No meaningful change. Filter construction cost is comparable.
Space: No change for the default bloom filter (no prefix_extractor). With a prefix_extractor, the filter grows slightly to accommodate prefix hashes, but deduplication of consecutive same-prefix hashes keeps the overhead minimal. Much less overhead than maintaining two separate filters.

Observability

Configuration: New with_filter_policies on DbBuilder / CompactorBuilder (programmatic only; not serializable to TOML/JSON). filter_bits_per_key has been removed from Settings.
Metrics: Existing sst_filter_positive_count, sst_filter_negative_count, sst_filter_false_positive_count counters are preserved. Each gains a kind label with values point or prefix so point-lookup filtering and prefix-scan filtering can be tracked independently (mirrors RocksDB’s separation of BLOOM_FILTER_FULL_* and BLOOM_FILTER_PREFIX_* tickers).

Compatibility

On-disk format: The filter block encoding is policy-specific but the SST envelope is unchanged. A new filter_format enum field is added to SST info. Old SSTs without this field default to Legacy (FlatBuffers absent-field semantics map to 0). See SST Format Changes.
Public API: filter_bits_per_key is removed from Settings. DbBuilder and CompactorBuilder gain with_filter_policies. See Configuration for migration examples. ScanOptions and ReadOptions gain an optional filter_context: Option<FilterContext> field; built-in filters ignore it, so existing callers need no changes.
Rolling upgrades: Old readers that don’t understand the filter_format field will ignore it (FlatBuffers forward compatibility). They will continue to decode filters as bloom filters, which is correct as long as the bloom filter policy is still in use.
Prefix extractor changes: Changing the prefix extractor changes the policy name (e.g., "_bf:p=fixed3" → "_bf:p=delim:"). With the array design, the old policy can remain in filter_policies alongside the new one, so existing SSTs’ filters remain usable while new SSTs are written with both. After compaction rewrites all SSTs, the old policy can be removed.
Compactor policy: If the compactor runs in a separate process from the writer (e.g., distributed compaction), it must be configured with the same filter_policies. Otherwise, compacted SSTs will be written with different (or no) filter policies, and existing filters may be silently dropped during compaction.

Testing

Unit tests: Filter policy correctness (no false negatives, expected false positive rates) for each built-in implementation.
Integration tests: Prefix scan SST skipping, mismatched filter policy graceful degradation, backwards compatibility with old SSTs.
Performance tests: Benchmark prefix scans with and without prefix bloom filters. The value should be very clear for prefix bloom filter results.

Rollout

Implementation will be in two phases:

Pluggable filter abstraction: Trait definitions (FilterPolicy, FilterBuilder, Filter, PrefixExtractor), refactor the existing bloom filter as the default BloomFilterPolicy, SST format changes (filter_format enum, composite filter block), and refactoring the write/read paths to use the pluggable filter policies with AND-logic evaluation. Breaking config change: filter_bits_per_key removed from Settings; with_filter_policies added to DbBuilder / CompactorBuilder.
Prefix bloom filter: Add prefix_extractor and whole_key_filtering to BloomFilterPolicy. Wire the prefix through the read path so each SST’s filters can be probed with FilterQuery::Prefix before opening. Skip SSTs where any filter rejects the prefix.

Future Enhancements

Recency-Based Iterator

Prefix bloom filters skip SSTs that don’t contain matching keys, but the merge-based scan still reads from every matching SST and merges them to produce global sort order. For workloads where the caller only needs the most recent entry for a prefix (e.g., latest version of a versioned key, most recent reading from a sensor), this is wasteful — the answer is almost always in the newest SST.

A recency-based iterator would return entries newest-first (most recent SST first, then next most recent) without merging, similar to how get() walks sources in recency order and returns the first match. Combined with prefix bloom filters, the iterator would skip SSTs with no matching prefix, read matching SSTs in recency order, and stop early once the caller has enough results.

Block-Level Filter Granularity

This RFC’s filters operate at SST granularity; the reader either skips an entire SST or reads it. This is appropriate for bloom filters (which need many keys for good false-positive rates) but too coarse for metadata-style filters.

For example, a min/max version filter on an SST whose blocks span versions [1,50], [51,100], and [101,150] can only report the SST-level range [1,150]. A query for version ≤ 110 must read the entire SST even though only the first block is relevant. With per-block metadata, the remaining blocks would be skipped without loading them from storage.

The FilterPolicy, FilterBuilder, and Filter traits defined in this RFC are sufficient to support block-level granularity. The core change is configuration: each policy would declare whether it applies at SST level, block level, or both. On the write path, block-level builders are finalized and reset per block in finish_block() instead of once at SST completion. On the read path, per-block filter data stored in BlockMeta is checked before loading a data block, using the same might_match interface.

Key Selection via `KeySelector`

The current BloomFilterPolicy applies to all keys in an SST. Some workloads need different filter strategies for different key subsets — e.g., a full-key bloom for keys starting with user:: and a prefix bloom for keys starting with post::. With the current design, this requires implementing a custom FilterPolicy that hard-codes the selection logic in add_entry and might_match.

A KeySelector trait would make this composable without custom policies:

pub trait KeySelector: Send + Sync {
    fn name(&self) -> &str;
    fn includes(&self, key: &[u8]) -> bool;
}

BloomFilterPolicy would gain an optional with_key_selector(Arc<dyn KeySelector>) method. On the write path, KeySelector::includes gates entry inclusion — if false, the entry is skipped entirely (no full-key hash, no prefix hash). On the read path for Point queries, excluded keys return true (inapplicable).

Prefix scan still relies on the configured PrefixExtractor, which operates on the extracted prefix of each selected key.

Example configuration:

let db = Db::builder("path", object_store)
    .with_filter_policies(vec![
        // Full-key bloom, only for "user::" keys
        Arc::new(BloomFilterPolicy::new(10)
            .with_key_selector(Arc::new(StartsWithSelector::new("user::")))),
        // Prefix bloom, only for "post::" keys
        Arc::new(BloomFilterPolicy::new(10)
            .with_key_selector(Arc::new(StartsWithSelector::new("post::")))
            .with_prefix_extractor(Arc::new(MyPrefixExtractor::new()))
            .with_whole_key_filtering(false)),
    ])
    .build().await?;

KeySelector would be BloomFilterPolicy-specific, not part of the core FilterPolicy/FilterBuilder/Filter traits. Custom filter policies can implement their own selection logic directly in add_entry/might_match.

Alternatives

1. Separate filter structures per SST: Maintain two separate bloom filters per SST: one for full keys and one for prefixes then route queries to the appropriate filter. This is conceptually clean but wastes space: two separate bit arrays have higher overhead than a single shared one. Rejected in favor of the single-filter approach.

2. Ship DIVA or SuRF as a built-in filter: The optimal filter design is workload-dependent. Rather than picking one, the pluggable trait lets users implement their own policy. Rejected as a built-in; users can integrate these as plugins.

3. Status quo (no prefix filtering): Prefix scans continue to read all overlapping SSTs. This is viable for small databases but becomes a bottleneck for large datasets with many sorted runs, as described in issues #1334 and #1302. Rejected because the performance impact is significant and the fix is well-understood.

Open Questions

Should `scan` / `scan_with_options` also support prefix filtering?

Resolved: Option A — No inference. Prefix filtering only via scan_prefix / scan_prefix_with_options. scan / scan_with_options never use prefix filters. This keeps behavior explicit and avoids ambiguity about when filtering is active.

Should `filter_policies` be an array or a single policy?

Resolved: Array. Both reviewers approved the array approach. The array enables migration (old policy stays alongside the new one), multi-filter composition (e.g., bloom + min/max), and avoids a future breaking API change. A “compound filter” that looks like a single filter to SlateDB but internally contains multiple sub-filters (e.g., filter_2_start_idx, filter_1_bytes, filter_2_bytes) was also discussed as a complementary approach for cases like different filter strategies per key subset — to be explored further in code.

Should range scan filtering be added?

This RFC only adds prefix-based SST filtering. Range scan filtering (skipping SSTs based on arbitrary range bounds) is not included but can be added with the current design — a FilterQuery::Range variant and a range-aware filter policy would slot into the existing might_match dispatch without structural changes. The read path would need to pass the full scan range bounds to the filter instead of just a prefix.

References

Updates

2026-04-21 — Reworked PrefixExtractor to collapse in_domain and prefix_len into a single method prefix_len(&FilterTarget) -> Option<usize>. The FilterTarget argument lets the extractor distinguish a complete key (Point) from a scan prefix (Prefix), so extractors that inspect bytes beyond the scan prefix (e.g., last-delimiter) can return None for Prefix while still extracting for Point. As a result, might_match now filters point lookups via the extracted prefix when whole_key_filtering = false, and prefix scans may truncate over-length scan prefixes and probe.