Prefix Bloom Filters via Pluggable Filter Policies
Status: Accepted
Authors:
Summary
Section titled “Summary”This RFC proposes adding prefix bloom filters to SlateDB, enabling prefix scans to skip SSTs that don’t contain matching keys. To support this cleanly, the hard-coded bloom filter is replaced with a pluggable filter policy abstraction where users can supply their own filter implementation (e.g., cuckoo filters, range filters) through traits. The existing bloom filter becomes the default implementation behind this trait, preserving full backwards compatibility.
Motivation
Section titled “Motivation”SlateDB currently uses a bloom filter that hashes full keys. This is effective for
point lookups (get) but provides no benefit for prefix scans, which are a
critical access pattern for many applications (see issue
#1334 and
#1302).
Today, a prefix scan must check all sorted runs and any SSTs whose key range could overlap, even though many of those SSTs contain no keys with the target prefix. A prefix-aware filter could eliminate these unnecessary SST reads.
There are many filter designs that could help here — each with different trade-offs. Rather than committing to one, we propose a pluggable filter policy that lets SlateDB ship with a default (the existing bloom filter) and an optional built-in prefix bloom filter, while enabling users to plug in specialized filters without modifying the engine.
- Define traits that abstract filter construction, serialization, and querying.
- Refactor the existing bloom filter as the default implementation.
- Support prefix bloom filters by allowing a user-defined key transformation before hashing.
- Support future filter implementations without engine changes.
- Support multiple concurrent filter policies per SST with AND logic evaluation.
- Enable safe migration from one filter policy to another without losing filtering on existing SSTs.
- Maintain full backwards compatibility: existing databases with bloom filters continue to work.
- Support both point-lookup filtering and prefix-scan.
Non-Goals
Section titled “Non-Goals”- Shipping additional filter implementations beyond the existing bloom filter and a prefix bloom filter.
Design
Section titled “Design”Traits
Section titled “Traits”The core abstraction is a FilterPolicy trait with associated builder and
filter types.
/// A named, configurable filter policy.pub trait FilterPolicy { /// An identifier for this policy. Stored per-SST so the reader knows /// whether it can decode and use the filter block. /// /// The name should encode anything that affects **compatibility** — /// i.e., anything that would make a filter unreadable or produce wrong /// results if mismatched. For the built-in bloom filter: /// - `bits_per_key` is not included — changing it affects quality (FP /// rate) but not the encoding format, so any reader can decode any /// bloom filter regardless of bits_per_key. /// - `prefix_extractor` name is included — it changes which hashes are /// stored, so querying with a different extractor produces false /// negatives. /// /// Examples: `"slatedb.BloomFilter"`, /// `"slatedb.BloomFilter:prefix=fixed3"`. fn name(&self) -> &str;
/// Creates a new builder for constructing a filter. fn builder(&self) -> Box<dyn FilterBuilder>;
/// Decodes a previously encoded filter. /// /// The engine matches the policy name stored in the composite filter /// block against `self.name()` before calling this to ensure the policy /// can deserialize the data. fn decode(&self, data: &[u8]) -> Arc<dyn Filter>;
/// Estimates the encoded size in bytes for a filter with `num_keys` keys. /// /// This is a hint used by the SST builder to reserve buffer space before /// the filter is built. It does not need to be exact. /// /// For the built-in bloom filter, the actual filter is sized from the /// real number of hashes collected during `add_entry`, not from this estimate, /// so overestimates waste a small allocation and underestimates just trigger /// a reallocation. fn estimate_size(&self, num_keys: usize) -> usize;}
/// Accumulator for entries during SST construction that produces a [Filter].pub trait FilterBuilder { /// Feeds an SST entry to the filter being built. /// /// The builder receives the full `RowEntry` (key, value, sequence /// number, timestamps) so that filter implementations are not limited /// to key-only hashing. A bloom filter will typically hash only the /// key (or a prefix of it), but other filters — for example a min/max /// timestamp filter or a sequence-number range filter — can inspect /// whichever fields they need. fn add_entry(&mut self, entry: &RowEntry);
/// Finalizes and return the completed filter. fn build(&self) -> Arc<dyn Filter>;}
/// A read-only filter that answers membership queries.pub trait Filter { /// Returns `true` if the filter cannot rule out the query. /// A return value of `false` guarantees no matching key exists. /// /// Filters that cannot answer a particular query kind should return /// `true` to avoid false negatives. fn might_match(&self, query: &FilterQuery) -> bool;
/// Serializes the filter into the provided buffer. fn encode(&self, writer: &mut impl BufMut);
/// Returns the size of the filter's data in bytes. /// /// This should reflect the underlying data structure size (e.g., the /// bit array length for a bloom filter). Can be used for memory /// tracking, cache accounting, etc. fn size(&self) -> usize;}
/// A membership query passed to [`Filter::might_match`].pub struct FilterQuery { /// The target of this query (a specific key or a prefix). pub target: FilterTarget, /// Opaque, caller-supplied context for custom filters. /// /// Custom filters downcast this to their expected type. For example, /// a min/max version filter expects `Arc<VersionBounds>` and ignores /// the query if the downcast fails. The built-in bloom filter ignores /// this field entirely. pub context: Option<Arc<dyn Any + Send + Sync>>,}
/// The target of a filter query.#[derive(Debug, Clone, PartialEq, Eq)]pub enum FilterTarget { /// Used to test whether a specific key might exist in the SST. Point(Bytes), /// Used to test whether any key with the given prefix might exist in the SST. Prefix(Bytes),}Key design decisions:
-
FilterQuerywith opaque context:FilterQueryis a struct rather than a plain enum so it can carry caller-supplied context alongside the query kind. Thecontextfield is anOption<Arc<dyn Any + Send + Sync>>cloned fromScanOptions::filter_context. Custom filters downcast it to their expected type — e.g., a min/max version filter downcasts toVersionBounds. If the downcast fails or context isNone, the filter returnstrue(cannot answer). TheArcclone is cheap (reference count bump) and the object can be reused across concurrent scans without per-call allocation. The built-in bloom filter ignores context entirely. -
Filter applicability is expressed through
might_matchreturningtrue: Rather than adding a separateapplies_toorcan_answermethod, filters express inapplicability by returningtruefrommight_match— meaning “I cannot rule this out.” This keeps the trait surface minimal and works uniformly for all filter types:- Build time:
FilterBuilder::add_entryreceives the fullRowEntry. Builders that don’t care about certain entries simply ignore them (e.g., a timestamp-range filter ignores entries without timestamps, a prefix bloom filter ignores entries whose key doesn’t match the extractor). - Read time (
get):might_matchreceivesFilterTarget::Pointwith the full key. Filters that can’t answer point queries returntrue. - Read time (
scan_prefix):might_matchreceivesFilterTarget::Prefixwith only the prefix. Filters that can’t answer prefix queries returntrue. For bloom filters with aPrefixExtractor, thein_domaincheck is performed insidemight_match— if the prefix is not in domain, the filter returnstruerather than risk a false negative.
This means the engine never needs to know which filters apply to which queries. It always evaluates all filters with AND logic and trusts each filter to return
truewhen it cannot answer. - Build time:
-
name()for safety: Each policy’s name is stored alongside its filter data in the composite filter block. When reading an SST, if a stored name doesn’t match any configured policy, that sub-filter is skipped rather than misinterpreted. This allows safe policy migration without manifest-level validation (same pattern as LevelDB’sFilterPolicy::Name()). -
FilterTargetenum: Instead of just accepting a hash, themight_matchmethod takes aFilterQuerywhosetargetfield distinguishes point lookups and prefix lookups. Filters that only support point lookups can conservatively returntruefor prefix queries. -
encodeonFilter,decodeonFilterPolicy: Encoding is on theFilterinstance because it knows its own internal representation. Decoding is on theFilterPolicybecause the caller needs the policy (which knows the format) to reconstruct aFilterfrom raw bytes.
Built-in Implementation: BloomFilterPolicy
Section titled “Built-in Implementation: BloomFilterPolicy”A single BloomFilterPolicy handles both full-key and prefix filtering using
one bloom filter per SST. Both full-key hashes and prefix hashes are added to
the same bit array.
PrefixExtractor
Section titled “PrefixExtractor”PrefixExtractor is specific to BloomFilterPolicy — it is not part of the
core FilterPolicy/FilterBuilder/Filter traits. Custom filter policies
that need prefix-based filtering can implement their own selection logic
directly in add_entry/might_match.
/// Extractor for a prefix from a key for use in prefix-based bloom filtering.////// Used on the write path to hash prefixes into the bloom filter during SST/// construction.pub trait PrefixExtractor { /// A unique name identifying this extractor's configuration. /// /// Changing the extractor (e.g. switching from a 4-byte fixed prefix /// to a delimiter-based one) changes which hashes are stored in the /// filter, so existing filters become invalid. `BloomFilterPolicy` /// includes this name in the policy name it writes to SST metadata /// (e.g. `"slatedb.BloomFilter:prefix=fixed3"`), which lets the reader /// detect the mismatch and skip the filter instead of returning wrong /// results. fn name(&self) -> &str;
/// Returns whether the given prefix is a valid output of `extract()`. /// /// This is used on the read path to verify that a scan prefix provided /// by the user matches the prefix format indexed in the filter. If this /// returns `false`, the filter must NOT be consulted; doing so can /// produce false negatives (the filter says "not present" for data that /// actually exists). /// /// **Example of incorrect behavior without this check:** /// Assume a prefix extractor that extracts the first 3 bytes of a /// key and an SST contains keys `abc_1`, `abc_2`, `abx_1`. /// During SST construction, the filter indexes the extracted prefixes: /// `abc` and `abx`. /// At read time, a user calls `scan_prefix("ab")`. The 2-byte prefix /// `"ab"` was never inserted into the filter, so `might_match("ab")` /// returns `false`. The SST is skipped even though all three keys match /// the scan prefix `"ab"`. This is a false negative. /// /// With `in_domain`: `in_domain("ab")` returns `false` (`"ab"` is not /// a valid 3-byte prefix), so the engine skips the filter check and /// falls back to scanning the SST directly. fn in_domain(&self, prefix: &[u8]) -> bool;
/// Extracts the prefix from a key. Returns `None` if the key does not /// contain a recognizable prefix (i.e., `in_domain` would return `false` /// for the key). fn extract<'a>(&self, key: &'a [u8]) -> Option<&'a [u8]>;}The policy would look like:
#[derive(Debug)]pub struct BloomFilterPolicy { bits_per_key: u32,
/// When true, full-key hashes are added to the filter. Point lookups /// (`get`) probe the filter with the full-key hash. /// Default: true. whole_key_filtering: bool,
/// When set, prefix hashes are also added to the filter. Prefix scans /// probe the filter with the prefix hash. /// Default: None (no prefix filtering). prefix_extractor: Option<Arc<dyn PrefixExtractor>>,
name: String,}
impl BloomFilterPolicy { pub fn new(bits_per_key: u32) -> Self { Self { bits_per_key, whole_key_filtering: true, prefix_extractor: None, name: "slatedb.BloomFilter".to_string(), } }
pub fn with_prefix_extractor(mut self, extractor: Arc<dyn PrefixExtractor>) -> Self { self.name = format!("slatedb.BloomFilter:prefix={}", extractor.name()); self.prefix_extractor = Some(extractor); self }
pub fn with_whole_key_filtering(mut self, enabled: bool) -> Self { self.whole_key_filtering = enabled; self }}
impl FilterPolicy for BloomFilterPolicy { fn name(&self) -> &str { &self.name }
fn builder(&self) -> Box<dyn FilterBuilder> { Box::new(BloomFilterBuilder::new( self.bits_per_key, self.whole_key_filtering, self.prefix_extractor.clone(), )) }
fn decode(&self, data: &[u8]) -> Arc<dyn Filter> { Arc::new(BloomFilter::decode( data, self.whole_key_filtering, self.prefix_extractor.is_some(), )) }
fn estimate_size(&self, num_keys: usize) -> usize { BloomFilter::estimate_encoded_size(num_keys, self.bits_per_key) }}Configuration
Section titled “Configuration”The filter_bits_per_key field is removed from Settings. Filter policies
are configured on DbBuilder and CompactorBuilder via with_filter_policies,
following the same pattern as merge_operator and block_transformer which
are also trait-object-based configuration that lives on the builders rather
than the serializable Settings struct.
Settings keeps min_filter_keys (a plain u32 that is serializable) since
it controls whether a filter is created, not which filter:
pub struct Settings { // ... existing fields ...
/// Write SSTables with a filter if the number of keys in the SSTable /// is greater than or equal to this value. pub min_filter_keys: u32,
// filter_bits_per_key is removed — replaced by filter_policies on // DbBuilder / CompactorBuilder.}DbBuilder and CompactorBuilder gain a with_filter_policies method:
impl<P: Into<Path>> DbBuilder<P> { /// Sets the filter policies for this database. Each policy produces a /// separate filter per SST, stored in a composite filter block. On /// read, all filters are evaluated with AND logic — an SST is skipped /// if any filter returns `false`. /// /// Defaults to `vec![Arc::new(BloomFilterPolicy::new(10))]`. /// Pass an empty vec to disable filters. pub fn with_filter_policies( mut self, policies: Vec<Arc<dyn FilterPolicy>>, ) -> Self { self.filter_policies = policies; self }}
impl<P: Into<Path>> CompactorBuilder<P> { /// Sets the filter policies used when the compactor rewrites SSTs. /// Must match the writer's policies to avoid silently dropping filters. pub fn with_filter_policies( mut self, policies: Vec<Arc<dyn FilterPolicy>>, ) -> Self { self.filter_policies = policies; self }}ScanOptions gets a filter_context field for passing opaque context to
custom filters:
pub struct ScanOptions { // ... existing fields ...
/// Opaque context passed to custom filters at query time. /// /// Custom filters downcast this to their expected type inside /// `might_match`. For example, a min/max version filter expects /// `Arc<VersionBounds>`. /// /// `None` means no context is provided; filters that need context /// return `true` (cannot rule out the SST). pub filter_context: Option<Arc<dyn Any + Send + Sync>>,}Configuration modes (for BloomFilterPolicy):
whole_key_filtering | prefix_extractor | Behavior |
|---|---|---|
true (default) | None | Full-key bloom only, today’s default, backwards compatible |
true | Some(...) | Both point lookups and prefix scans are filtered |
false | Some(...) | Prefix-only — smaller filter, no point-lookup filtering |
Usage:
// Default: full-key bloom filter (backwards compatible, no explicit call needed)let db = Db::builder("path", object_store).build().await?;
// Full-key + prefix bloom filter (recommended for prefix scan workloads)let db = Db::builder("path", object_store) .with_filter_policies(vec![Arc::new(BloomFilterPolicy::new(10) .with_prefix_extractor(Arc::new(MyPrefixExtractor::new())))]) .build() .await?;
// Prefix-only bloom filter (no point-lookup filtering)let db = Db::builder("path", object_store) .with_filter_policies(vec![Arc::new(BloomFilterPolicy::new(10) .with_prefix_extractor(Arc::new(MyPrefixExtractor::new())) .with_whole_key_filtering(false))]) .build() .await?;
// Multiple filters: bloom + custom min/max filterlet db = Db::builder("path", object_store) .with_filter_policies(vec![ Arc::new(BloomFilterPolicy::new(10) .with_prefix_extractor(Arc::new(MyPrefixExtractor::new()))), Arc::new(MyMinMaxFilterPolicy::new(...)), ]) .build() .await?;
// Passing context to custom filters at scan time// The Arc can be created once and reused across scans.let version_bounds: Arc<dyn Any + Send + Sync> = Arc::new(VersionBounds { min_version: 0, max_version: 42,});let scan_options = ScanOptions { filter_context: Some(version_bounds), ..ScanOptions::default()};Write path: building the filter
Section titled “Write path: building the filter”During SST construction, the builder adds both hashes to the same underlying bloom filter. Consecutive keys sharing the same prefix only store the prefix hash once (deduplication), keeping the overhead minimal.
Read path: querying the filter
Section titled “Read path: querying the filter”The same filter is probed with different hashes depending on the query type:
impl Filter for BloomFilter { fn might_match(&self, query: &FilterQuery) -> bool { match &query.target { FilterTarget::Point(key) => { if !self.whole_key_filtering { return true; // Cannot answer point queries } self.might_contain(filter_hash(key.as_ref())) } FilterTarget::Prefix(prefix) => { // No PrefixExtractor → cannot answer prefix queries let Some(ref extractor) = self.prefix_extractor else { return true; }; // in_domain check prevents false negatives when the prefix // is shorter/incompatible with what was indexed if !extractor.in_domain(prefix.as_ref()) { return true; } self.might_contain(filter_hash(prefix.as_ref())) } } }}SST Format Changes
Section titled “SST Format Changes”The SST gains a filter_version field to distinguish the new composite block
format from the legacy single-bloom format:
table SsTableInfo { // ... existing fields (filter_offset, filter_len, etc.) ...
// Version of the filter block format. // Absent or 1: legacy single bloom filter (raw bytes). // 2: composite block list of named filters. filter_version: uint16;}Composite filter block encoding
Section titled “Composite filter block encoding”When filter_version is 2, the filter block at
filter_offset/filter_len contains a self-describing list of named filters:
[num_filters: u32][name_len: u32][name: bytes][data_len: u32][data: bytes] // filter 0[name_len: u32][name: bytes][data_len: u32][data: bytes] // filter 1...Each sub-filter self-identifies by its policy’s name(). Even with a single
policy, the block uses this format (a list of one).
Reading logic (backwards compatibility)
Section titled “Reading logic (backwards compatibility)”When reading an SST’s filter block:
filter_versionabsent or1: Legacy pre-RFC SST. The raw bytes are decoded using the built-inBloomFilterPolicy. If no matching bloom policy is configured, the filter is skipped.filter_version=2: Parse the composite block. For each(name, data)entry, find a matching policy in the configuredfilter_policiesbyname()and callpolicy.decode(data). Unknown names are skipped (reduces filtering power but never produces incorrect results).
Write Path Integration
Section titled “Write Path Integration”The SST builder receives the filter_policies array. For each policy, it calls
builder() to obtain a FilterBuilder. Each key is fed to all builders via
add_entry(). On finalization, each builder produces a filter via build(), and
all filters are encoded into a composite filter block. The SST info’s
filter_version is set to 2.
Read Path Integration
Section titled “Read Path Integration”The current read path uses filter_hash() and BloomFilter::might_contain()
directly. We replace this with the FilterQuery abstraction and AND-logic
evaluation across all filters.
Point lookups (get):
The BloomFilterEvaluator in sst_iter.rs currently hashes the key and calls
might_contain. With the pluggable design:
// Before:let key_hash = filter::filter_hash(self.key.as_ref());filter.might_contain(key_hash)
// After:let query = FilterQuery::point(self.key.clone());// AND logic: skip SST if ANY filter returns falsefilters.iter().all(|f| f.might_match(&query))Prefix scans:
SlateDB already provides scan_prefix and scan_prefix_with_options methods
that accept a prefix. Internally, the prefix will be wired through the reader and
iterator chain so each SST’s filters can be checked before opening it, skipping
SSTs where any filter returns false.
When a prefix_extractor is configured on the BloomFilterPolicy, prefix
scans probe the bloom filter with filter_hash(prefix). The default
configuration (no prefix_extractor) returns true for prefix queries, so no
filtering. This is safe: point lookups still use the full-key hash. Other
filters in the array may still reject the SST based on context or other criteria.
Note: prefix filtering alone is not ideal for recency access patterns where the caller only needs a recent entry for a prefix. See Recency-Based Iterator in Future Enhancements.
Impact Analysis
Section titled “Impact Analysis”SlateDB features and components that this RFC interacts with. Check all that apply.
Core API and Query Semantics
Section titled “Core API and Query Semantics”- Basic KV API (
get/put/delete) - Range queries, iterators, seek semantics
- Range deletions
- Error model, API errors
Consistency, Isolation, and Multi-Versioning
Section titled “Consistency, Isolation, and Multi-Versioning”- Transactions
- Snapshots
- Sequence numbers
Time, Retention, and Derived State
Section titled “Time, Retention, and Derived State”- Time to live (TTL)
- Compaction filters
- Merge operator
- Change Data Capture (CDC)
Metadata, Coordination, and Lifecycles
Section titled “Metadata, Coordination, and Lifecycles”- Manifest format
- Checkpoints
- Clones
- Garbage collection
- Database splitting and merging
- Multi-writer
Compaction
Section titled “Compaction”- Compaction state persistence
- Compaction filters
- Compaction strategies
- Distributed compaction
- Compactions format
Storage Engine Internals
Section titled “Storage Engine Internals”- Write-ahead log (WAL)
- Block cache
- Object store cache
- Indexing (bloom filters, metadata)
- SST format or block format
Ecosystem and Operations
Section titled “Ecosystem and Operations”- CLI tools
- Language bindings (Go/Python/etc)
- Observability (metrics/logging/tracing)
Operations
Section titled “Operations”Performance and Cost
Section titled “Performance and Cost”- Point lookup latency: Negligible change. The dynamic dispatch overhead of
dyn Filteris a single vtable lookup per SST which shouldn’t be a bottleneck compared to I/O costs, so the default bloom filter performs identically to today. - Prefix scan latency: Significant improvement when a prefix-aware filter is configured. SSTs that don’t contain the target prefix are skipped entirely, avoiding block reads.
- Write throughput: No meaningful change. Filter construction cost is comparable.
- Space: No change for the default bloom filter (no
prefix_extractor). With aprefix_extractor, the filter grows slightly to accommodate prefix hashes, but deduplication of consecutive same-prefix hashes keeps the overhead minimal. Much less overhead than maintaining two separate filters.
Observability
Section titled “Observability”- Configuration: New
with_filter_policiesonDbBuilder/CompactorBuilder(programmatic only; not serializable to TOML/JSON).filter_bits_per_keyhas been removed fromSettings. - Metrics: Existing
sst_filter_positives,sst_filter_negatives,sst_filter_false_positivesmetrics are preserved.
Compatibility
Section titled “Compatibility”- On-disk format: The filter block encoding is policy-specific but the SST
envelope is unchanged. A new
filter_versionfield is added to SST info. Old SSTs without this field are assumed to use the legacy single-bloom format (see SST Format Changes). - Public API:
filter_bits_per_keyis removed fromSettings.DbBuilderandCompactorBuildergainwith_filter_policies.ScanOptionsgains afilter_context: Option<Arc<dyn Any + Send + Sync>>field. See Configuration for migration examples. - Rolling upgrades: Old readers that don’t understand the
filter_versionfield will ignore it (FlatBuffers forward compatibility). They will continue to decode filters as bloom filters, which is correct as long as the bloom filter policy is still in use. - Prefix extractor changes: Changing the prefix extractor changes the policy
name (e.g.,
"slatedb.BloomFilter:prefix=fixed3"→"slatedb.BloomFilter:prefix=delim:"). With the array design, the old policy can remain infilter_policiesalongside the new one, so existing SSTs’ filters remain usable while new SSTs are written with both. After compaction rewrites all SSTs, the old policy can be removed. - Compactor policy: If the compactor runs in a separate process from the
writer (e.g., distributed compaction), it must be configured with the same
filter_policies. Otherwise, compacted SSTs will be written with different (or no) filter policies, and existing filters may be silently dropped during compaction.
Testing
Section titled “Testing”- Unit tests: Filter policy correctness (no false negatives, expected false positive rates) for each built-in implementation.
- Integration tests: Prefix scan SST skipping, mismatched filter policy graceful degradation, backwards compatibility with old SSTs.
- Performance tests: Benchmark prefix scans with and without prefix bloom filters. The value should be very clear for prefix bloom filter results.
Rollout
Section titled “Rollout”Implementation will be in two phases:
- Pluggable filter abstraction: Trait definitions (
FilterPolicy,FilterBuilder,Filter,PrefixExtractor), refactor the existing bloom filter as the defaultBloomFilterPolicy, SST format changes (filter_version, composite filter block), and refactoring the write/read paths to use the pluggable filter policies with AND-logic evaluation. Breaking config change:filter_bits_per_keyremoved fromSettings;with_filter_policiesadded toDbBuilder/CompactorBuilder. - Prefix bloom filter: Add
prefix_extractorandwhole_key_filteringtoBloomFilterPolicy. Wire the prefix through the read path so each SST’s filters can be probed withFilterQuery::Prefixbefore opening. Skip SSTs where any filter rejects the prefix.
Future Enhancements
Section titled “Future Enhancements”Recency-Based Iterator
Section titled “Recency-Based Iterator”Prefix bloom filters skip SSTs that don’t contain matching keys, but the merge-based scan still reads from every matching SST and merges them to produce global sort order. For workloads where the caller only needs the most recent entry for a prefix (e.g., latest version of a versioned key, most recent reading from a sensor), this is wasteful — the answer is almost always in the newest SST.
A recency-based iterator would return entries newest-first (most recent SST
first, then next most recent) without merging, similar to how get() walks
sources in recency order and returns the first match. Combined with prefix
bloom filters, the iterator would skip SSTs with no matching prefix, read
matching SSTs in recency order, and stop early once the caller has enough
results.
Block-Level Filter Granularity
Section titled “Block-Level Filter Granularity”This RFC’s filters operate at SST granularity; the reader either skips an entire SST or reads it. This is appropriate for bloom filters (which need many keys for good false-positive rates) but too coarse for metadata-style filters.
For example, a min/max version filter on an SST whose blocks span versions
[1,50], [51,100], and [101,150] can only report the SST-level range
[1,150]. A query for version ≤ 110 must read the entire SST even though only
the first block is relevant. With per-block metadata, the remaining blocks
would be skipped without loading them from storage.
The FilterPolicy, FilterBuilder, and Filter traits defined in this RFC
are sufficient to support block-level granularity. The core change is
configuration: each policy would declare whether it applies at SST level,
block level, or both. On the write path, block-level builders are finalized
and reset per block in finish_block() instead of once at SST completion.
On the read path, per-block filter data stored in BlockMeta is checked
before loading a data block, using the same might_match interface.
Key Selection via KeySelector
Section titled “Key Selection via KeySelector”The current BloomFilterPolicy applies to all keys in an SST. Some workloads
need different filter strategies for different key subsets — e.g., a full-key
bloom for keys starting with user:: and a prefix bloom for keys starting
with post::. With the current design, this requires implementing a custom FilterPolicy that
hard-codes the selection logic in add_entry and might_match.
A KeySelector trait would make this composable without custom policies:
pub trait KeySelector: Send + Sync { fn name(&self) -> &str; fn includes(&self, key: &[u8]) -> bool;}BloomFilterPolicy would gain an optional with_key_selector(Arc<dyn KeySelector>)
method. On the write path, KeySelector::includes gates entry inclusion — if
false, the entry is skipped entirely (no full-key hash, no prefix hash). On
the read path for Point queries, excluded keys return true (inapplicable).
Prefix scan will still need PrefixExtractor::in_domain as the filter is mainly for keys.
Example configuration:
let db = Db::builder("path", object_store) .with_filter_policies(vec![ // Full-key bloom, only for "user::" keys Arc::new(BloomFilterPolicy::new(10) .with_key_selector(Arc::new(StartsWithSelector::new("user::")))), // Prefix bloom, only for "post::" keys Arc::new(BloomFilterPolicy::new(10) .with_key_selector(Arc::new(StartsWithSelector::new("post::"))) .with_prefix_extractor(Arc::new(FixedPrefixExtractor::new(6))) .with_whole_key_filtering(false)), ]) .build().await?;KeySelector would be BloomFilterPolicy-specific, not part of the core
FilterPolicy/FilterBuilder/Filter traits. Custom filter policies can
implement their own selection logic directly in add_entry/might_match.
Alternatives
Section titled “Alternatives”1. Separate filter structures per SST: Maintain two separate bloom filters per SST: one for full keys and one for prefixes then route queries to the appropriate filter. This is conceptually clean but wastes space: two separate bit arrays have higher overhead than a single shared one. Rejected in favor of the single-filter approach.
2. Ship DIVA or SuRF as a built-in filter: The optimal filter design is workload-dependent. Rather than picking one, the pluggable trait lets users implement their own policy. Rejected as a built-in; users can integrate these as plugins.
3. Status quo (no prefix filtering): Prefix scans continue to read all overlapping SSTs. This is viable for small databases but becomes a bottleneck for large datasets with many sorted runs, as described in issues #1334 and #1302. Rejected because the performance impact is significant and the fix is well-understood.
Open Questions
Section titled “Open Questions”Should scan / scan_with_options also support prefix filtering?
Section titled “Should scan / scan_with_options also support prefix filtering?”Resolved: Option A — No inference. Prefix filtering only via scan_prefix /
scan_prefix_with_options. scan / scan_with_options never use prefix
filters. This keeps behavior explicit and avoids ambiguity about when filtering
is active.
Should filter_policies be an array or a single policy?
Section titled “Should filter_policies be an array or a single policy?”Resolved: Array. Both reviewers approved the array approach. The array
enables migration (old policy stays alongside the new one), multi-filter
composition (e.g., bloom + min/max), and avoids a future breaking API change.
A “compound filter” that looks like a single filter to SlateDB but internally
contains multiple sub-filters (e.g., filter_2_start_idx, filter_1_bytes, filter_2_bytes) was also discussed as a complementary approach for cases like
different filter strategies per key subset — to be explored further in code.
Should range scan filtering be added?
Section titled “Should range scan filtering be added?”This RFC only adds prefix-based SST filtering. Range scan filtering (skipping
SSTs based on arbitrary range bounds) is not included but can be added with the
current design — a FilterQuery::Range variant and a range-aware filter policy
would slot into the existing might_match dispatch without structural changes.
The read path would need to pass the full scan range bounds to the filter
instead of just a prefix.
References
Section titled “References”- Issue #1334: Improve prefix scan performance with filtering
- Issue #1302: High scan operation latency
- LevelDB FilterPolicy
- RocksDB Bloom Filter wiki
- RocksDB Prefix Seek
- DIVA: Dynamic Range Filter for Var-Length Keys (VLDB 2025)
- SuRF: Succinct Range Filters (SIGMOD 2018)