# Prefix Bloom Filters via Pluggable Filter Policies

Status: Accepted

Authors:

* [Hussein Nomier](https://github.com/nomiero)

## Summary

This RFC proposes adding prefix bloom filters to SlateDB, enabling prefix
scans to skip SSTs that don't contain matching keys. To support this cleanly,
the hard-coded bloom filter is replaced with a pluggable filter policy
abstraction where users can supply their own filter implementation (e.g., cuckoo
filters, range filters) through traits. The existing bloom filter becomes the
default implementation behind this trait, preserving full backwards
compatibility.

## Motivation

SlateDB currently uses a bloom filter that hashes full keys. This is effective for
point lookups (`get`) but provides no benefit for prefix scans, which are a
critical access pattern for many applications (see issue
[#1334](https://github.com/slatedb/slatedb/issues/1334) and
[#1302](https://github.com/slatedb/slatedb/issues/1302)).

Today, a prefix scan must check all sorted runs and any SSTs whose key range
could overlap, even though many of those SSTs contain no keys with the target
prefix. A prefix-aware filter could eliminate these unnecessary SST reads.

There are many filter designs that could help here — each with different
trade-offs. Rather than committing to one, we propose a pluggable
filter policy that lets SlateDB ship with a default (the existing bloom filter)
and an optional built-in prefix bloom filter, while enabling users to plug in
specialized filters without modifying the engine.

## Goals

- Define traits that abstract filter construction, serialization, and querying.
- Refactor the existing bloom filter as the default implementation.
- Support prefix bloom filters by allowing a user-defined key transformation before hashing.
- Support future filter implementations without engine changes.
- Support multiple concurrent filter policies per SST with AND logic evaluation.
- Enable safe migration from one filter policy to another without losing filtering on existing SSTs.
- Maintain full backwards compatibility: existing databases with bloom filters continue to work.
- Support both point-lookup filtering and prefix-scan.

## Non-Goals

- Shipping additional filter implementations beyond the existing bloom filter and a prefix bloom filter.

## Design

### Traits

The core abstraction is a `FilterPolicy` trait with associated builder and
filter types.

```rust
/// A named, configurable filter policy.
pub trait FilterPolicy {
    /// An identifier for this policy. Stored per-SST so the reader knows
    /// whether it can decode and use the filter block.
    ///
    /// The name should encode anything that affects **compatibility** —
    /// i.e., anything that would make a filter unreadable or produce wrong
    /// results if mismatched. For the built-in bloom filter:
    /// - `bits_per_key` is not included — changing it affects quality (FP
    ///   rate) but not the encoding format, so any reader can decode any
    ///   bloom filter regardless of bits_per_key.
    /// - `prefix_extractor` name is included — it changes which hashes are
    ///   stored, so querying with a different extractor produces false
    ///   negatives.
    ///
    /// Examples: `"_bf"`, `"_bf:p=fixed3"`.
    fn name(&self) -> &str;

    /// Creates a new builder for constructing a filter.
    fn builder(&self) -> Box<dyn FilterBuilder>;

    /// Decodes a previously encoded filter.
    ///
    /// The engine matches the policy name stored in the composite filter
    /// block against `self.name()` before calling this to ensure the policy
    /// can deserialize the data.
    fn decode(&self, data: &[u8]) -> Arc<dyn Filter>;

    /// Estimates the encoded size in bytes for a filter with `num_keys` keys.
    ///
    /// This is a hint used by the SST builder to reserve buffer space before
    /// the filter is built. It does not need to be exact.
    ///
    /// For the built-in bloom filter, the actual filter is sized from the
    /// real number of hashes collected during `add_entry`, not from this estimate,
    /// so overestimates waste a small allocation and underestimates just trigger
    /// a reallocation.
    fn estimate_size(&self, num_keys: usize) -> usize;
}

/// Accumulator for entries during SST construction that produces a [Filter].
pub trait FilterBuilder {
    /// Feeds an SST entry to the filter being built.
    ///
    /// The builder receives the full `RowEntry` (key, value, sequence
    /// number, timestamps) so that filter implementations are not limited
    /// to key-only hashing.  A bloom filter will typically hash only the
    /// key (or a prefix of it), but other filters — for example a min/max
    /// timestamp filter or a sequence-number range filter — can inspect
    /// whichever fields they need.
    fn add_entry(&mut self, entry: &RowEntry);

    /// Finalizes and returns the completed filter.
    fn build(&mut self) -> Arc<dyn Filter>;
}

/// A read-only filter that answers membership queries.
pub trait Filter {
    /// Returns `true` if the filter cannot rule out the query.
    /// A return value of `false` guarantees no matching key exists.
    ///
    /// Filters that cannot answer a particular query kind should return
    /// `true` to avoid false negatives.
    fn might_match(&self, query: &FilterQuery) -> bool;

    /// Serializes the filter into the provided buffer.
    fn encode(&self, writer: &mut dyn BufMut);

    /// Returns the size of the filter's data in bytes.
    ///
    /// This should reflect the underlying data structure size (e.g., the
    /// bit array length for a bloom filter). Can be used for memory
    /// tracking, cache accounting, etc.
    fn size(&self) -> usize;

    /// Returns a copy with over-allocated memory reclaimed.
    ///
    /// Used by the block cache to reduce memory waste from `Bytes` slices
    /// that reference larger underlying buffers.
    fn clamp_allocated_size(&self) -> Arc<dyn Filter>;
}

/// A membership query passed to [`Filter::might_match`].
pub struct FilterQuery {
    /// The target of this query (a specific key or a prefix).
    pub target: FilterTarget,
    /// Opaque, caller-supplied context for custom filters, forwarded from
    /// `ScanOptions::filter_context` or `ReadOptions::filter_context`.
    ///
    /// Custom filters decode this according to their own byte layout. For
    /// example, a min/max version filter might expect two big-endian `u64`s
    /// packed into the `Inline` variant. The built-in bloom filter ignores
    /// this field entirely.
    pub context: Option,
}

/// The target of a filter query.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum FilterTarget {
    /// Used to test whether a specific key might exist in the SST.
    Point(Bytes),
    /// Used to test whether any key with the given prefix might exist in the SST.
    Prefix(Bytes),
}

/// Caller-supplied opaque context. Carries raw bytes that a custom filter
/// decodes itself.
///
/// Marked `#[non_exhaustive]` so new variants can be added (e.g., a heap
/// `Bytes` variant for larger payloads, or typed variants) as concrete
/// user use cases emerge, without it being a breaking change.
#[derive(Clone, Debug)]
#[non_exhaustive]
pub enum FilterContext {
    /// Inline 64-byte payload with no heap allocation. Suitable for
    /// pairs of `u64`s, `u128`s, and other fixed-layout small structs.
    Inline([u8; 64]),
}
```

Key design decisions:

1. **`FilterQuery` with opaque context**: `FilterQuery` carries an optional
   `FilterContext` forwarded from `ScanOptions` / `ReadOptions`. Custom filters
   decode it to their expected layout; when it is `None` or undecodable, they
   return `true`. The built-in bloom filter ignores it.

2. **Filter applicability is expressed through `might_match` returning `true`**:
   Rather than adding a separate `applies_to` or `can_answer` method, filters
   express inapplicability by returning `true` from `might_match` — meaning
   "I cannot rule this out." This keeps the trait surface minimal and works
   uniformly for all filter types:

   - **Build time**: `FilterBuilder::add_entry` receives the full `RowEntry`.
     Builders that don't care about certain entries simply ignore them (e.g.,
     a timestamp-range filter ignores entries without timestamps, a prefix
     bloom filter ignores entries whose key doesn't match the extractor).
   - **Read time (`get`)**: `might_match` receives `FilterTarget::Point`
     with the full key. Filters that can't answer point queries return `true`.
   - **Read time (`scan_prefix`)**: `might_match` receives
     `FilterTarget::Prefix` with only the prefix. Filters that can't answer
     prefix queries return `true`. For bloom filters with a `PrefixExtractor`,
     the extractor is consulted *inside* `might_match` — if the scan prefix
     has no extractable prefix, the filter returns `true` rather than risk a
     false negative.

   This means the engine never needs to know which filters apply to which
   queries. It always evaluates all filters with AND logic and trusts each
   filter to return `true` when it cannot answer.

3. **`name()` for safety**: Each policy's name is stored alongside its filter
   data in the composite filter block. When reading an SST, if a stored name
   doesn't match any configured policy, that sub-filter is skipped rather than
   misinterpreted. This allows safe policy migration without manifest-level
   validation (same pattern as LevelDB's `FilterPolicy::Name()`).

4. **`FilterTarget` enum**: Instead of just accepting a hash, the `might_match`
   method takes a `FilterQuery` whose `target` field distinguishes point lookups
   and prefix lookups. Filters that only support point lookups can conservatively
   return `true` for prefix queries.

5. **`encode` on `Filter`, `decode` on `FilterPolicy`**: Encoding is on the
   `Filter` instance because it knows its own internal representation.
   Decoding is on the `FilterPolicy` because the caller needs the policy
   (which knows the format) to reconstruct a `Filter` from raw bytes.

### Built-in Implementation: `BloomFilterPolicy`

A single `BloomFilterPolicy` handles both full-key and prefix filtering using
one bloom filter per SST. Both full-key hashes and prefix hashes are added to
the same bit array.

#### `PrefixExtractor`

`PrefixExtractor` is specific to `BloomFilterPolicy` — it is not part of the
core `FilterPolicy`/`FilterBuilder`/`Filter` traits. Custom filter policies
that need prefix-based filtering can implement their own selection logic
directly in `add_entry`/`might_match`.

```rust
/// Extractor for a prefix from a byte string, used to build and probe
/// prefix-based bloom filters.
///
/// The extractor's output is always a *prefix* of its input — if
/// `prefix_len(target)` returns `Some(n)`, then the bytes inside `target`
/// sliced to `..n` are the extracted prefix.
///
/// The `target` argument distinguishes two different semantic questions:
///
/// - [`FilterTarget::Point`] — the input is a complete key (a stored key
///   during SST construction, or the target of a point lookup).
///   `prefix_len` returns the extraction length that was / will be hashed
///   into the filter. **Invariant**: if `prefix_len(Point(k)) = Some(n)`,
///   then for every `k'` with `k'[..n] == k[..n]`, also
///   `prefix_len(Point(k')) = Some(n)`. The extraction depends only on
///   the first `n` bytes.
///
/// - [`FilterTarget::Prefix`] — the input is a scan prefix. `prefix_len`
///   returns `Some(n)` only if probing with the first `n` bytes is safe
///   for *every* possible extension of the input. **Invariant**: if
///   `prefix_len(Prefix(p)) = Some(n)`, then for every extension `q` of
///   `p`, `prefix_len(Point(q)) = Some(n)` and `q[..n] == p[..n]`. When
///   this cannot be guaranteed (e.g., the extractor inspects later bytes
///   of the key, as a "last-delimiter" extractor would), the extractor
///   must return `None` for the `Prefix` variant.
///
/// For most extractors — fixed-length, first-delimiter, anchored-prefix —
/// the two variants return the same answer. The split matters for
/// extractors whose extraction depends on the full input; those can still
/// be used for the build and point paths while conservatively disabling
/// prefix-scan filtering.
pub trait PrefixExtractor {
    /// A unique name identifying this extractor's configuration.
    ///
    /// Changing the extractor (e.g. switching from a 4-byte fixed prefix
    /// to a delimiter-based one) changes which hashes are stored in the
    /// filter, so existing filters become invalid. `BloomFilterPolicy`
    /// includes this name in the policy name it writes to SST metadata
    /// (e.g. `"_bf:p=fixed3"`), which lets the reader
    /// detect the mismatch and skip the filter instead of returning wrong
    /// results.
    fn name(&self) -> &str;

    /// Returns the length `n` such that the bytes of `target` sliced to
    /// `..n` are the extracted prefix, or `None` if this target has no
    /// extractable prefix under this extractor.
    ///
    /// Called on three paths, distinguished by the variant of `target`:
    /// - **Build time** — policy wraps each stored key in
    ///   `FilterTarget::Point` and hashes `key[..n]` into the filter when
    ///   this returns `Some(n)`.
    /// - **Point reads** — policy forwards the incoming
    ///   `FilterTarget::Point` and probes with `hash(key[..n])`.
    /// - **Prefix reads** — policy forwards the incoming
    ///   `FilterTarget::Prefix` and probes with `hash(prefix[..n])`; a
    ///   `None` result causes the filter to be skipped so no false
    ///   negative can occur.
    ///
    /// **Worked example.** A 3-byte fixed extractor with an SST
    /// containing keys `abc_1`, `abc_2`, `abx_1` stores hashes of `abc`
    /// and `abx`. Then:
    /// - `Prefix("ab")` → `None` (2 < 3; filter skipped).
    /// - `Prefix("abc")` → `Some(3)` → probe `hash("abc")`.
    /// - `Prefix("abcd")` → `Some(3)` → probe `hash("abc")` (truncation
    ///   safe by the `Prefix` invariant).
    /// - `Point("abc_1")` → `Some(3)` → probe `hash("abc")`.
    ///
    /// For a *last-delimiter* extractor (extract up to and including the
    /// last `:`), the `Prefix` variant must return `None` always — the
    /// last delimiter's position can change as bytes are appended, so no
    /// scan prefix is safe to probe. The `Point` variant still returns
    /// the position correctly for complete keys.
    fn prefix_len(&self, target: &FilterTarget) -> Option<usize>;
}
```

The policy would look like:
```rust
pub struct BloomFilterPolicy {
    bits_per_key: u32,

    /// When true, full-key hashes are added to the filter. Point lookups
    /// (`get`) probe the filter with the full-key hash.
    /// Default: true.
    whole_key_filtering: bool,

    /// When set, prefix hashes are also added to the filter. Prefix scans
    /// probe the filter with the prefix hash.
    /// Default: None (no prefix filtering).
    prefix_extractor: Option>,

    name: String,
}

impl BloomFilterPolicy {
    pub const NAME: &'static str = "_bf";

    pub fn new(bits_per_key: u32) -> Self {
        Self {
            bits_per_key,
            whole_key_filtering: true,
            prefix_extractor: None,
            name: Self::NAME.to_string(),
        }
    }

    /// Configures a prefix extractor for prefix-based bloom filtering.
    ///
    /// When set, the bloom filter hashes both extracted prefixes and (if
    /// `whole_key_filtering` is enabled) full keys. Prefix scans can then
    /// probe the filter to skip SSTs that contain no matching prefixes.
    ///
    /// The extractor's name is included in the policy name to ensure that
    /// filters built with different extractors are not mismatched.
    pub fn with_prefix_extractor(mut self, extractor: Arc<dyn PrefixExtractor>) -> Self {
        self.name = format!("{}:p={}", Self::NAME, extractor.name());
        self.prefix_extractor = Some(extractor);
        self
    }

    /// Controls whether full keys are hashed into the bloom filter.
    ///
    /// When `true` (the default), point lookups can use the filter. Set
    /// to `false` if only prefix scans are needed and you want to reduce
    /// filter size.
    pub fn with_whole_key_filtering(mut self, enabled: bool) -> Self {
        self.whole_key_filtering = enabled;
        self
    }
}

impl FilterPolicy for BloomFilterPolicy {
    fn name(&self) -> &str { &self.name }

    fn builder(&self) -> Box<dyn FilterBuilder> {
        Box::new(BloomFilterBuilder::new(
            self.bits_per_key,
            self.whole_key_filtering,
            self.prefix_extractor.clone(),
        ))
    }

    fn decode(&self, data: &[u8]) -> Arc<dyn Filter> {
        Arc::new(BloomFilter::decode(
            data,
            self.whole_key_filtering,
            self.prefix_extractor.clone(),
        ))
    }

    fn estimate_size(&self, num_keys: usize) -> usize {
        BloomFilter::estimate_encoded_size(num_keys, self.bits_per_key)
    }
}
```

#### Configuration

The `filter_bits_per_key` field is removed from `Settings`. Filter policies
are configured on `DbBuilder` and `CompactorBuilder` via `with_filter_policies`,
following the same pattern as `merge_operator` and `block_transformer` which
are also trait-object-based configuration that lives on the builders rather
than the serializable `Settings` struct.

`Settings` keeps `min_filter_keys` (a plain `u32` that is serializable) since
it controls *whether* a filter is created, not *which* filter:

```rust
pub struct Settings {
    // ... existing fields ...

    /// Write SSTables with a filter if the number of keys in the SSTable
    /// is greater than or equal to this value.
    pub min_filter_keys: u32,

    // filter_bits_per_key is removed — replaced by filter_policies on
    // DbBuilder / CompactorBuilder.
}
```

`DbBuilder` and `CompactorBuilder` gain a `with_filter_policies` method:

```rust
impl> DbBuilder {
    /// Sets the filter policies for this database. Each policy produces a
    /// separate filter per SST, stored in a composite filter block. On
    /// read, all filters are evaluated with AND logic — an SST is skipped
    /// if any filter returns `false`.
    ///
    /// Defaults to `vec![Arc::new(BloomFilterPolicy::new(10))]`.
    /// Pass an empty vec to disable filters.
    pub fn with_filter_policies(
        mut self,
        policies: Vec>,
    ) -> Self {
        self.filter_policies = policies;
        self
    }
}

impl> CompactorBuilder {
    /// Sets the filter policies used when the compactor rewrites SSTs.
    /// Must match the writer's policies to avoid silently dropping filters.
    pub fn with_filter_policies(
        mut self,
        policies: Vec>,
    ) -> Self {
        self.filter_policies = policies;
        self
    }
}
```

Both `ScanOptions` and `ReadOptions` get a `filter_context` field for
passing opaque context to custom filters:

```rust
pub struct ScanOptions {
    // ... existing fields ...

    /// Opaque context passed to custom filter policies at query time.
    /// Built-in filters (including the default bloom filter) ignore this
    /// field, so leaving it `None` is always safe.
    pub filter_context: Option,
}

pub struct ReadOptions {
    // ... existing fields ...

    /// Opaque context passed to custom filter policies at query time.
    /// See `ScanOptions::filter_context`.
    pub filter_context: Option,
}
```

The two-variant shape lets callers choose their byte encoding:

- `Inline([u8; 64])` covers the common fixed-layout case (e.g., a pair of
  `u64`s for a min/max version filter) with zero heap allocation.
- `Bytes(Bytes)` covers larger or variable-length payloads.

**Configuration modes (for BloomFilterPolicy):**

| `whole_key_filtering` | `prefix_extractor` | Hashes stored       | `get`                   | `scan_prefix`            |
|-----------------------|--------------------|---------------------|-------------------------|--------------------------|
| `true` (default)      | `None`             | full-key            | full-key hash           | not filtered             |
| `true`                | `Some(...)`        | full-key + prefix   | full-key hash (tighter) | extracted prefix hash    |
| `false`               | `Some(...)`        | prefix only         | extracted prefix hash   | extracted prefix hash    |

The third row — prefix-only storage with point lookups also probing the
extracted prefix — fits workloads whose keys follow a
`GroupId ‖ Suffix` pattern, where every query (scan or point) is
answered by matching on the group id. The filter is sized against the
number of distinct group ids (small), and point lookups reuse those
same hashes instead of paying to store a full-key hash per row. If the
extracted prefix is not in the filter, the full key cannot be in the
SST either, so the probe is safe.

Usage:

```rust
// Default: full-key bloom filter (backwards compatible, no explicit call needed)
let db = Db::builder("path", object_store).build().await?;

// Full-key + prefix bloom filter (recommended for prefix scan workloads)
let db = Db::builder("path", object_store)
    .with_filter_policies(vec![Arc::new(BloomFilterPolicy::new(10)
        .with_prefix_extractor(Arc::new(MyPrefixExtractor::new())))])
    .build()
    .await?;

// Prefix-only bloom filter. Point lookups probe the filter with the
// extracted prefix of the queried key; scan_prefix probes with the
// extracted prefix of the scan target. No full-key hashes are stored.
let db = Db::builder("path", object_store)
    .with_filter_policies(vec![Arc::new(BloomFilterPolicy::new(10)
        .with_prefix_extractor(Arc::new(MyPrefixExtractor::new()))
        .with_whole_key_filtering(false))])
    .build()
    .await?;

// Multiple filters: bloom + custom min/max filter
let db = Db::builder("path", object_store)
    .with_filter_policies(vec![
        Arc::new(BloomFilterPolicy::new(10)
            .with_prefix_extractor(Arc::new(MyPrefixExtractor::new()))),
        Arc::new(MyMinMaxFilterPolicy::new(...)),
    ])
    .build()
    .await?;

// Passing context to custom filters at scan time. For a min/max version
// filter that expects two big-endian u64s, the Inline variant gives a
// no-heap-allocation path even when bounds change per scan:
let mut payload = [0u8; 64];
payload[..8].copy_from_slice(&min_version.to_be_bytes());
payload[8..16].copy_from_slice(&max_version.to_be_bytes());
let scan_options = ScanOptions::default()
    .with_filter_context(Some(FilterContext::Inline(payload)));
```

#### Write path: building the filter

During SST construction, the builder adds both hashes to the same underlying
bloom filter. Consecutive keys sharing the same prefix only store the prefix
hash once (deduplication), keeping the overhead minimal.

#### Read path: querying the filter

The same filter is probed with different hashes depending on the query type:

```rust
impl Filter for BloomFilter {
    fn might_match(&self, query: &FilterQuery) -> bool {
        match &query.target {
            FilterTarget::Point(key) if self.whole_key_filtering => {
                // Full-key hash gives the tightest answer when available.
                self.might_contain(filter_hash(key.as_ref()))
            }
            target => {
                // Otherwise defer to the extractor. For `Point`, this is
                // the fallback when whole-key filtering is disabled —
                // we probe with the extracted prefix of the queried key.
                // For `Prefix`, the extractor answers whether the scan
                // prefix is safe to probe (returning `None` when the
                // extractor's extraction depends on bytes beyond the
                // scan prefix, e.g., a last-delimiter extractor).
                let Some(ref extractor) = self.prefix_extractor else {
                    return true;
                };
                let Some(n) = extractor.prefix_len(target) else {
                    return true;
                };
                let bytes = match target {
                    FilterTarget::Point(k) => k.as_ref(),
                    FilterTarget::Prefix(p) => p.as_ref(),
                };
                self.might_contain(filter_hash(&bytes[..n]))
            }
        }
    }
}
```

### SST Format Changes

The SST gains a `filter_format` enum field to distinguish the new composite
block format from the legacy single-bloom format:

```fbs
enum FilterFormat: byte {
    // Single raw bloom filter bytes (pre-composite format).
    Legacy,
    // Block containing one or more named filters, each prefixed with its
    // name and length.
    Composite
}

table SsTableInfo {
    // ... existing fields (filter_offset, filter_len, etc.) ...

    // Filter block format. Defaults to Legacy for backwards compatibility.
    filter_format: FilterFormat;
}
```

#### Composite filter block encoding

When `filter_format` is `Composite`, the filter block at
`filter_offset`/`filter_len` contains a list of named filters:

```
[num_filters: u16]
[name_len: u16][name: bytes][data_len: u64][data: bytes]   // filter 0
[name_len: u16][name: bytes][data_len: u64][data: bytes]   // filter 1
...
```

Each sub-filter self-identifies by its policy's `name()`. Even with a single
policy, the block uses this format (a list of one).

#### Reading logic (backwards compatibility)

When reading an SST's filter block:

- **`FilterFormat::Legacy`**: The filter block contains raw bloom filter bytes. This is the FlatBuffer default, so SSTs written by older versions are
  automatically treated as Legacy. The bytes are decoded via the configured
  `BloomFilterPolicy`. If no bloom policy is configured, the filter is skipped.
- **`FilterFormat::Composite`**: The filter block is a composite containing
  one or more named sub-filters. Each `(name, data)` entry is matched against
  the configured `filter_policies` by `name()` and decoded via
  `policy.decode(data)`. Unrecognized names are skipped.

### Write Path Integration

The SST builder receives the `filter_policies` array. For each policy, it calls
`builder()` to obtain a `FilterBuilder`. Each key is fed to all builders via
`add_entry()`. On finalization, each builder produces a filter via `build()`, and
all filters are encoded into a composite filter block. The SST info's
`filter_format` is set to `Composite`.

### Read Path Integration

The current read path uses `filter_hash()` and `BloomFilter::might_contain()`
directly. We replace this with the `FilterQuery` abstraction and AND-logic
evaluation across all filters.

**Point lookups (`get`):**

The `BloomFilterEvaluator` in `sst_iter.rs` currently hashes the key and calls
`might_contain`. With the pluggable design:

```rust
// Before:
let key_hash = filter::filter_hash(self.key.as_ref());
filter.might_contain(key_hash)

// After:
let query = FilterQuery::point(self.key.clone());
// AND logic: skip SST if ANY filter returns false
filters.iter().all(|f| f.might_match(&query))
```

**Prefix scans:**

SlateDB already provides `scan_prefix` and `scan_prefix_with_options` methods
that accept a prefix. Internally, the prefix will be wired through the reader and
iterator chain so each SST's filters can be checked before opening it, skipping
SSTs where any filter returns `false`.

When a `prefix_extractor` is configured on the `BloomFilterPolicy`, prefix
scans probe the bloom filter with `filter_hash(scan_prefix[..n])` where
`n = extractor.prefix_len(FilterTarget::Prefix(scan_prefix))`. The
extractor's `Prefix`-variant contract guarantees that this probe is safe
for every extension of the scan prefix; when the extractor cannot offer
that guarantee (e.g., a last-delimiter extractor), it returns `None` and
the filter is skipped. The default configuration (no `prefix_extractor`)
also returns `true` for prefix queries, so no filtering. Other filters
in the array may still reject the SST based on context or other
criteria.

With `whole_key_filtering = false` and a `prefix_extractor` configured,
point lookups (`get`) also benefit from the filter without any additional
storage: `might_match(Point(k))` probes with
`hash(k[..extractor.prefix_len(FilterTarget::Point(k))])`. This fits
`GroupId ‖ Suffix` key schemas: the extractor matches the group id and
hashing the suffix contributes no extra pruning power, since any point
query already knows which group it targets.

Note: prefix filtering alone is not ideal for recency access patterns where
the caller only needs a recent entry for a prefix. See
[Recency-Based Iterator](#recency-based-iterator) in Future Enhancements.

## Impact Analysis

SlateDB features and components that this RFC interacts with. Check all that apply.

### Core API and Query Semantics

- [x] Basic KV API (`get`/`put`/`delete`)
- [x] Range queries, iterators, seek semantics
- [ ] Range deletions
- [ ] Error model, API errors

### Consistency, Isolation, and Multi-Versioning

- [ ] Transactions
- [ ] Snapshots
- [ ] Sequence numbers

### Time, Retention, and Derived State

- [ ] Time to live (TTL)
- [ ] Compaction filters
- [ ] Merge operator
- [ ] Change Data Capture (CDC)

### Metadata, Coordination, and Lifecycles

- [x] Manifest format
- [ ] Checkpoints
- [ ] Clones
- [ ] Garbage collection
- [ ] Database splitting and merging
- [ ] Multi-writer

### Compaction

- [ ] Compaction state persistence
- [ ] Compaction filters
- [ ] Compaction strategies
- [ ] Distributed compaction
- [ ] Compactions format

### Storage Engine Internals

- [ ] Write-ahead log (WAL)
- [x] Block cache
- [ ] Object store cache
- [x] Indexing (bloom filters, metadata)
- [x] SST format or block format

### Ecosystem and Operations

- [ ] CLI tools
- [x] Language bindings (Go/Python/etc)
- [x] Observability (metrics/logging/tracing)

## Operations

### Performance and Cost

- **Point lookup latency**: Negligible change. The dynamic dispatch overhead of
  `dyn Filter` is a single vtable lookup per SST which shouldn't be a bottleneck
  compared to I/O costs, so the default bloom filter performs identically to today.
- **Prefix scan latency**: Significant improvement when a prefix-aware filter is
  configured. SSTs that don't contain the target prefix are skipped entirely,
  avoiding block reads.
- **Write throughput**: No meaningful change. Filter construction cost is
  comparable.
- **Space**: No change for the default bloom filter (no `prefix_extractor`).
  With a `prefix_extractor`, the filter grows slightly to accommodate prefix
  hashes, but deduplication of consecutive same-prefix hashes keeps the
  overhead minimal. Much less overhead than maintaining two separate filters.

### Observability

- **Configuration**: New `with_filter_policies` on `DbBuilder` /
  `CompactorBuilder` (programmatic only; not serializable to TOML/JSON).
  `filter_bits_per_key` has been removed from `Settings`.
- **Metrics**: Existing `sst_filter_positive_count`, `sst_filter_negative_count`,
  `sst_filter_false_positive_count` counters are preserved. Each gains a
  `kind` label with values `point` or `prefix` so point-lookup filtering and
  prefix-scan filtering can be tracked independently (mirrors RocksDB's
  separation of `BLOOM_FILTER_FULL_*` and `BLOOM_FILTER_PREFIX_*` tickers).

### Compatibility

- **On-disk format**: The filter block encoding is policy-specific but the SST
  envelope is unchanged. A new `filter_format` enum field is added to SST info.
  Old SSTs without this field default to `Legacy` (FlatBuffers absent-field
  semantics map to `0`). See [SST Format Changes](#sst-format-changes).
- **Public API**: `filter_bits_per_key` is removed from `Settings`.
  `DbBuilder` and `CompactorBuilder` gain `with_filter_policies`. See
  [Configuration](#configuration) for migration examples. `ScanOptions`
  and `ReadOptions` gain an optional `filter_context: Option`
  field; built-in filters ignore it, so existing callers need no changes.
- **Rolling upgrades**: Old readers that don't understand the `filter_format`
  field will ignore it (FlatBuffers forward compatibility). They will continue
  to decode filters as bloom filters, which is correct as long as the bloom
  filter policy is still in use.
- **Prefix extractor changes**: Changing the prefix extractor changes the policy
  name (e.g., `"_bf:p=fixed3"` →
  `"_bf:p=delim:"`). With the array design, the old
  policy can remain in `filter_policies` alongside the new one, so existing
  SSTs' filters remain usable while new SSTs are written with both. After
  compaction rewrites all SSTs, the old policy can be removed.
- **Compactor policy**: If the compactor runs in a separate process from the
  writer (e.g., distributed compaction), it must be configured with the same
  `filter_policies`. Otherwise, compacted SSTs will be written with different
  (or no) filter policies, and existing filters may be silently dropped during
  compaction.

## Testing

- **Unit tests**: Filter policy correctness (no false negatives, expected false
  positive rates) for each built-in implementation.
- **Integration tests**: Prefix scan SST skipping, mismatched filter policy
  graceful degradation, backwards compatibility with old SSTs.
- **Performance tests**: Benchmark prefix scans with and without prefix bloom
  filters. The value should be very clear for prefix bloom filter results.

## Rollout

Implementation will be in two phases:

1. **Pluggable filter abstraction**: Trait definitions (`FilterPolicy`,
   `FilterBuilder`, `Filter`, `PrefixExtractor`), refactor the existing bloom
   filter as the default `BloomFilterPolicy`, SST format changes
   (`filter_format` enum, composite filter block), and refactoring the write/read
   paths to use the pluggable filter policies with AND-logic evaluation.
   Breaking config change: `filter_bits_per_key` removed from `Settings`;
   `with_filter_policies` added to `DbBuilder` / `CompactorBuilder`.
2. **Prefix bloom filter**: Add `prefix_extractor` and `whole_key_filtering`
   to `BloomFilterPolicy`. Wire the prefix through the read path so each
   SST's filters can be probed with `FilterQuery::Prefix` before opening.
   Skip SSTs where any filter rejects the prefix.

## Future Enhancements

### Recency-Based Iterator

Prefix bloom filters skip SSTs that don't contain matching keys, but the
merge-based scan still reads from every *matching* SST and merges them to
produce global sort order. For workloads where the caller only needs the most
recent entry for a prefix (e.g., latest version of a versioned key, most
recent reading from a sensor), this is wasteful — the answer is almost always
in the newest SST.

A recency-based iterator would return entries newest-first (most recent SST
first, then next most recent) without merging, similar to how `get()` walks
sources in recency order and returns the first match. Combined with prefix
bloom filters, the iterator would skip SSTs with no matching prefix, read
matching SSTs in recency order, and stop early once the caller has enough
results.

### Block-Level Filter Granularity

This RFC's filters operate at SST granularity; the reader either skips an
entire SST or reads it. This is appropriate for bloom filters (which need many
keys for good false-positive rates) but too coarse for metadata-style filters.

For example, a min/max version filter on an SST whose blocks span versions
`[1,50]`, `[51,100]`, and `[101,150]` can only report the SST-level range
`[1,150]`. A query for version ≤ 110 must read the entire SST even though only
the first block is relevant. With per-block metadata, the remaining blocks
would be skipped without loading them from storage.

The `FilterPolicy`, `FilterBuilder`, and `Filter` traits defined in this RFC
are sufficient to support block-level granularity. The core change is
configuration: each policy would declare whether it applies at SST level,
block level, or both. On the write path, block-level builders are finalized
and reset per block in `finish_block()` instead of once at SST completion.
On the read path, per-block filter data stored in `BlockMeta` is checked
before loading a data block, using the same `might_match` interface.

### Key Selection via `KeySelector`

The current `BloomFilterPolicy` applies to all keys in an SST. Some workloads
need different filter strategies for different key subsets — e.g., a full-key
bloom for keys starting with `user::` and a prefix bloom for keys starting
with `post::`. With the current design, this requires implementing a custom `FilterPolicy` that
hard-codes the selection logic in `add_entry` and `might_match`.

A `KeySelector` trait would make this composable without custom policies:

```rust
pub trait KeySelector: Send + Sync {
    fn name(&self) -> &str;
    fn includes(&self, key: &[u8]) -> bool;
}
```

`BloomFilterPolicy` would gain an optional `with_key_selector(Arc<dyn KeySelector>)`
method. On the write path, `KeySelector::includes` gates entry inclusion — if
`false`, the entry is skipped entirely (no full-key hash, no prefix hash). On
the read path for Point queries, excluded keys return `true` (inapplicable).

Prefix scan still relies on the configured `PrefixExtractor`, which
operates on the extracted prefix of each selected key.

Example configuration:

```rust
let db = Db::builder("path", object_store)
    .with_filter_policies(vec![
        // Full-key bloom, only for "user::" keys
        Arc::new(BloomFilterPolicy::new(10)
            .with_key_selector(Arc::new(StartsWithSelector::new("user::")))),
        // Prefix bloom, only for "post::" keys
        Arc::new(BloomFilterPolicy::new(10)
            .with_key_selector(Arc::new(StartsWithSelector::new("post::")))
            .with_prefix_extractor(Arc::new(MyPrefixExtractor::new()))
            .with_whole_key_filtering(false)),
    ])
    .build().await?;
```

`KeySelector` would be `BloomFilterPolicy`-specific, not part of the core
`FilterPolicy`/`FilterBuilder`/`Filter` traits. Custom filter policies can
implement their own selection logic directly in `add_entry`/`might_match`.

## Alternatives

**1. Separate filter structures per SST:**
Maintain two separate bloom filters per SST: one for full keys and one for
prefixes then route queries to the appropriate filter. This is conceptually
clean but wastes space: two separate bit arrays have higher overhead than a
single shared one. Rejected in favor of the single-filter approach.

**2. Ship DIVA or SuRF as a built-in filter:**
The optimal filter design is workload-dependent. Rather than picking one, the
pluggable trait lets users implement their own policy. Rejected as a built-in;
users can integrate these as plugins.

**3. Status quo (no prefix filtering):**
Prefix scans continue to read all overlapping SSTs. This is viable for small
databases but becomes a bottleneck for large datasets with many sorted runs,
as described in issues #1334 and #1302. Rejected because the performance
impact is significant and the fix is well-understood.

## Open Questions

### Should `scan` / `scan_with_options` also support prefix filtering?

**Resolved: Option A — No inference.** Prefix filtering only via `scan_prefix` /
`scan_prefix_with_options`. `scan` / `scan_with_options` never use prefix
filters. This keeps behavior explicit and avoids ambiguity about when filtering
is active.

### Should `filter_policies` be an array or a single policy?

**Resolved: Array.** Both reviewers approved the array approach. The array
enables migration (old policy stays alongside the new one), multi-filter
composition (e.g., bloom + min/max), and avoids a future breaking API change.
A "compound filter" that looks like a single filter to SlateDB but internally
contains multiple sub-filters (e.g., `filter_2_start_idx, filter_1_bytes,
filter_2_bytes`) was also discussed as a complementary approach for cases like
different filter strategies per key subset — to be explored further in code.

### Should range scan filtering be added?

This RFC only adds prefix-based SST filtering. Range scan filtering (skipping
SSTs based on arbitrary range bounds) is not included but can be added with the
current design — a `FilterQuery::Range` variant and a range-aware filter policy
would slot into the existing `might_match` dispatch without structural changes.
The read path would need to pass the full scan range bounds to the filter
instead of just a prefix.

## References

- [Issue #1334: Improve prefix scan performance with filtering](https://github.com/slatedb/slatedb/issues/1334)
- [Issue #1302: High scan operation latency](https://github.com/slatedb/slatedb/issues/1302)
- [LevelDB FilterPolicy](https://github.com/google/leveldb/blob/main/include/leveldb/filter_policy.h)
- [RocksDB Bloom Filter wiki](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter)
- [RocksDB Prefix Seek](https://github.com/facebook/rocksdb/wiki/Prefix-Seek)
- [DIVA: Dynamic Range Filter for Var-Length Keys (VLDB 2025)](https://www.vldb.org/pvldb/vol18/p3923-eslami.pdf)
- [SuRF: Succinct Range Filters (SIGMOD 2018)](https://dl.acm.org/doi/fullHtml/10.1145/3375660)

## Updates

- **2026-04-21** — Reworked `PrefixExtractor` to collapse `in_domain`
  and `prefix_len` into a single method
  `prefix_len(&FilterTarget) -> Option<usize>`. The `FilterTarget`
  argument lets the extractor distinguish a complete key (`Point`) from
  a scan prefix (`Prefix`), so extractors that inspect bytes beyond the
  scan prefix (e.g., last-delimiter) can return `None` for `Prefix`
  while still extracting for `Point`. As a result, `might_match` now
  filters point lookups via the extracted prefix when
  `whole_key_filtering = false`, and prefix scans may truncate
  over-length scan prefixes and probe.