Compaction Filters

Table of Contents:

Summary
Motivation
Goals
Non-Goals
Design
Limitations
Impact Analysis
Operations
Testing
Rollout
Alternatives
References

Status: Accepted

Authors:

Hussein Nomier

Summary

This RFC introduces a public API for user-provided compaction filters in SlateDB. Users implement a CompactionFilterSupplier that creates a CompactionFilter instance for each compaction job. Each filter can inspect entries and decide to keep, drop, convert to tombstone, or modify values. The design makes existing internal types (RowEntry, ValueDeletable) public and uses CompactionJobContext to provide context to the filter.

Compaction filters are gated behind the compaction_filters feature flag. Enabling this feature may affect snapshot consistency. See Limitations for details.

Motivation

SlateDB has no public API for custom compaction filters. Users need this capability for:

Custom TTL Logic: Application-specific expiration beyond built-in TTL support.
MVCC Garbage Collection: Custom policies for user-defined versioning.
Schema Migrations: Data format conversions during compaction.

This design could also unify internal TTL handling (RetentionIterator) with user-provided filters in the future.

Goals

Provide a user-friendly API following established SlateDB patterns (CompactionSchedulerSupplier).
Zero overhead when no filter is configured. Existing users are unaffected.
Design that can potentially be extended to enable internal TTL filtering.

Non-Goals

Replacing the internal RetentionIterator immediately.
Modifying the key bytes of an entry key (filters can only modify values).
Emitting new entries during compaction (filters can only keep, drop, tombstone, or modify existing entries).
Guaranteeing snapshot consistency when compaction filters are enabled (see Limitations).

Design

Public Types

The following existing internal types are made public to support compaction filters:

/// Entry in the LSM tree (made public).
pub struct RowEntry {
    pub key: Bytes,
    pub value: ValueDeletable,
    pub seq: u64,
    pub create_ts: Option&lt;i64&gt;,
    pub expire_ts: Option&lt;i64&gt;,
}

/// Value that can be a value, merge operand, or tombstone (made public).
pub enum ValueDeletable {
    Value(Bytes),
    Merge(Bytes),
    Tombstone,
}

CompactionJobContext

/// Context information about a compaction job.
///
/// This struct provides read-only information about the current compaction job
/// to help filters make informed decisions.
pub struct CompactionJobContext {
    /// The destination sorted run ID for this compaction.
    pub destination: u32,
    /// Whether the destination sorted run is the last (oldest) run after compaction.
    /// When true, tombstones can be safely dropped since there are no older versions below.
    pub is_dest_last_run: bool,
    /// The logical clock tick representing the logical time the compaction occurs.
    /// This is used to make decisions about retention of expiring records.
    pub compaction_clock_tick: i64,
    /// Optional minimum sequence number to retain.
    ///
    /// Entries with sequence numbers at or above this threshold are protected by
    /// active snapshots. Dropping or modifying such entries may cause snapshot
    /// reads to return inconsistent results.
    pub retention_min_seq: Option&lt;u64&gt;,
}

CompactionFilterDecision

/// Decision returned by a compaction filter for each entry.
pub enum CompactionFilterDecision {
    /// Keep the entry unchanged.
    Keep,
    /// Drop the entry entirely. The entry will not appear in the compaction output.
    ///
    /// WARNING: Dropping an entry removes it completely without leaving a tombstone.
    /// This means older versions of the same key in lower levels of the LSM tree
    /// may become visible again (&quot;resurrection&quot;). Only use Drop when you are certain
    /// there are no older versions that could resurface, or when that behavior is
    /// acceptable for your use case.
    Drop,
    /// Modify the entry&#39;s value.
    ///
    /// Pass `ValueDeletable::Tombstone` to convert the entry to a tombstone.
    /// When converting to a tombstone, the entry&#39;s `expire_ts` is automatically cleared.
    ///
    /// Pass `ValueDeletable::Value(bytes)` to change the value. Key and other
    /// metadata remain unchanged.
    ///
    /// Note: If `Value` is applied to a tombstone, the entry becomes a regular value
    /// with the tombstone&#39;s sequence number, effectively resurrecting the key.
    Modify(ValueDeletable),
}

CompactionFilter Trait

/// Filter that processes entries during compaction.
///
/// Each filter instance is created for a single compaction job and executes
/// single-threaded on the compactor thread. The filter must be `Send + Sync`
/// to satisfy iterator trait requirements.
#[async_trait]
pub trait CompactionFilter: Send + Sync {
    /// Filter a single entry.
    ///
    /// Return `Ok(decision)` to keep, drop, or modify the entry.
    /// Return `Err(FilterError)` to abort the compaction job.
    ///
    /// This method is async to allow I/O operations (e.g., checking external
    /// services, loading configuration). However, for best performance, prefer
    /// doing I/O in `create_compaction_filter()` or `on_compaction_end()` when
    /// possible, since this method is called for every entry.
    async fn filter(
        &amp;mut self,
        entry: &amp;RowEntry,
    ) -&gt; Result&lt;CompactionFilterDecision, FilterError&gt;;

    /// Called after processing all entries.
    ///
    /// Use this hook to flush state, log statistics, or clean up resources.
    /// Return `Err` to abort the compaction with an error.
    ///
    /// Note: This is only called after all entries have been processed. If compaction
    /// fails due to an earlier error, this method is not invoked.
    async fn on_compaction_end(&amp;mut self) -&gt; Result&lt;(), CompactionEndError&gt;;
}

CompactionFilterSupplier Trait

/// Supplier that creates a CompactionFilter instance per compaction job.
///
/// The supplier is shared across all compactions and must be thread-safe (`Send + Sync`).
/// It creates a new filter instance for each compaction job, providing isolated state per job.
#[async_trait]
pub trait CompactionFilterSupplier: Send + Sync {
    /// Create a filter for a compaction job. Return Err to abort compaction.
    ///
    /// This is async to allow I/O during initialization (loading config,
    /// connecting to external services, etc.) before the filter processes entries.
    ///
    /// # Arguments
    ///
    /// * `context` - Context about the compaction job (destination, clock tick, etc.)
    async fn create_compaction_filter(
        &amp;self,
        context: &amp;CompactionJobContext,
    ) -&gt; Result&lt;Box&lt;dyn CompactionFilter&gt;, CreationError&gt;;
}

Error Types

/// Errors that can occur during compaction filter operations.
#[derive(Debug, Error)]
pub enum CompactionFilterError {
    /// Filter creation failed in `create_compaction_filter`. This aborts the
    /// compaction.
    #[error(&quot;filter creation failed: {0}&quot;)]
    CreationError(#[source] Box&lt;dyn std::error::Error + Send + Sync&gt;),

    /// Filter failed while processing an entry. This aborts the compaction.
    #[error(&quot;filter error: {0}&quot;)]
    FilterError(#[source] Box&lt;dyn std::error::Error + Send + Sync&gt;),

    /// Filter failed during `on_compaction_end`. This aborts the compaction.
    #[error(&quot;compaction end error: {0}&quot;)]
    CompactionEndError(#[source] Box&lt;dyn std::error::Error + Send + Sync&gt;),
}

These error types wrap the underlying cause, preserving error chains for debugging. The #[source] attribute enables std::error::Error::source() to return the wrapped error.

Configuration

The CompactionFilterSupplier is configured on the component that runs compaction:

// In DbBuilder (db/builder.rs) - for embedded compactor
pub fn with_compaction_filter_supplier(
    mut self,
    supplier: Arc&lt;dyn CompactionFilterSupplier&gt;,
) -&gt; Self {
    self.compaction_filter_supplier = Some(supplier);
    self
}

// In CompactorBuilder (db/builder.rs) - for standalone compactor
pub fn with_compaction_filter_supplier(
    mut self,
    supplier: Arc&lt;dyn CompactionFilterSupplier&gt;,
) -&gt; Self {
    self.compaction_filter_supplier = Some(supplier);
    self
}

When running a standalone compactor (separate from the DB writer), user needs to ensure the CompactorBuilder is configured with the same CompactionFilterSupplier as the DbBuilder.

Background: How SlateDB Compaction Works

For a comprehensive overview of SlateDB’s compaction design, see RFC-0002: Compaction.

SlateDB uses an LSM-tree architecture with two main storage layers:

L0 (Level 0): Recently flushed SSTs from the memtable. These may have overlapping key ranges.
Sorted Runs: Compacted SSTs organized into sorted runs, each containing non-overlapping key ranges. Sorted runs are identified by ID, where lower IDs contain older data.

┌─────────────────────────────────────────────────┐
│  L0 (newest data)                               │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐                │
│  │SST 4│ │SST 3│ │SST 2│ │SST 1│                │
│  └─────┘ └─────┘ └─────┘ └─────┘                │
├─────────────────────────────────────────────────┤
│  Sorted Runs (compacted, older data below)      │
│                                                 │
│  SR 10 (newest)  ──►  SR 5  ──►  SR 0 (oldest)  │
└─────────────────────────────────────────────────┘

Compaction merges entries from multiple sources (L0 SSTs and/or sorted runs) into a single destination sorted run. The compaction executor processes entries one at a time through an iterator pipeline.

The core compaction loop in execute_compaction_job() is straightforward:

while let Some(kv) = all_iter.next_entry().await? {
    current_writer.add(kv).await?;
    // ... handle SST size limits, progress reporting ..etc.
}

Each call to next_entry() retrieves the next entry from the iterator pipeline, which handles merging, deduplication, and filtering.

Iterator Stack Integration

The compaction executor builds an iterator pipeline in load_iterators(). Each layer wraps the previous one, processing entries as they flow through:

MergeIterator (L0 + SortedRuns)
    -&gt; MergeOperatorIterator (resolve merge operands)
    -&gt; RetentionIterator (TTL, snapshot retention, tombstone cleanup)
    -&gt; CompactionFilterIterator (user filters)

What each iterator does:

MergeIterator: Combines entries from all input sources (L0 SSTs and sorted runs) into a single sorted stream. When the same key appears in multiple sources, entries are ordered by sequence number (newest first). This is where the actual “merge” in merge-sort happens.
MergeOperatorIterator: Resolves merge operands. If the user of the database uses a MergeOperator, this iterator combines consecutive merge operands into a single resolved value.
RetentionIterator: Applies built-in retention policies:
- Drops expired entries (TTL).
- Removes old versions not needed by snapshots.
- Cleans up tombstones at the bottommost level.
CompactionFilterIterator (this RFC): Applies user-provided filters. This is where CompactionFilter::filter() method is called for each entry.

This ordering ensures:

Merge operands are resolved before filtering.
Expired entries and old versions are already removed.
User filters only see “live” entries that would otherwise be written.

Limitations

Feature Flag and Snapshot Consistency

Compaction filters are gated behind the compaction_filters feature flag:

[dependencies]
slatedb = { version = &quot;...&quot;, features = [&quot;compaction_filters&quot;] }

Enabling this feature may affect snapshot consistency.

Why?

Protecting snapshot data from arbitrary user filters adds significant complexity. Not all use cases require snapshot consistency guarantees, so we start simple with a feature flag to ensure users understand the trade-offs. This design can evolve if new use cases emerge that require snapshot protection.

RocksDB faced the same challenge and removed snapshot protection from compaction filters in v6.0, noting “the feature has a bug which can’t be easily fixed.”

When using compaction filters with snapshots, be aware that:

Filters may modify or drop entries that snapshots expect to see
Snapshot reads may return unexpected results if the filter altered the data
Users who need consistent snapshots should carefully consider their filter logic

Potential for Internal TTL Unification

The CompactionFilter trait is designed to be general enough that internal TTL filtering could potentially be refactored to use the same abstraction. However, the current RetentionIterator buffers all versions of a key before applying retention policies (e.g., keeping boundary values for snapshot consistency). Unifying these would require either refactoring RetentionIterator to work entry-by-entry, or extending the filter API to receive all versions of a key at once.

Error Handling

Method	Error Type	Behavior
`create_compaction_filter()`	`CreationError`	Aborts compaction job
`filter()`	`FilterError`	Aborts compaction job
`on_compaction_end()`	`CompactionEndError`	Aborts compaction job

Creating a fresh filter instance per compaction provides:

Isolation: No shared mutable state between compaction jobs.
Single-threaded execution: Filter runs on the same thread as the compactor, no synchronization needed.
State tracking: Filters can safely accumulate statistics or state across all entries in a compaction.
Simplified reasoning: No concurrent access concerns within a filter.

Performance Considerations

The filter() method is called for every entry during compaction. While the method is async to allow I/O when needed, frequent I/O per entry can significantly impact compaction throughput. For best performance:

Prefer batching I/O: Load configuration or external state in create_compaction_filter() rather than per-entry in filter().
Cache decisions: If checking an external service, cache results to avoid repeated calls for similar entries.

For CPU-intensive filters:

If your filter performs expensive synchronous computation (e.g., complex parsing, cryptographic operations), consider using a dedicated compaction runtime to prevent blocking your application’s main runtime:

let compaction_runtime = tokio::runtime::Builder::new_multi_thread()
    .worker_threads(2)
    .thread_name(&quot;compaction&quot;)
    .build()?;

let db = Db::builder(&quot;mydb&quot;, object_store)
    .with_compaction_runtime(compaction_runtime.handle().clone())
    .with_compaction_filter_supplier(Arc::new(MyCpuIntensiveFilter))
    .build()
    .await?;

Usage Example

use slatedb::{
    CompactionFilter, CompactionFilterSupplier, CompactionJobContext,
    CompactionFilterDecision, CompactionFilterError, RowEntry, ValueDeletable,
};
use bytes::Bytes;
use std::sync::Arc;
use async_trait::async_trait;

/// A filter that converts all entries with a specific key prefix to tombstones.
struct PrefixDroppingFilter {
    prefix: Bytes,
    dropped_count: u64,
}

#[async_trait]
impl CompactionFilter for PrefixDroppingFilter {
    async fn filter(
        &amp;mut self,
        entry: &amp;RowEntry,
    ) -&gt; Result&lt;CompactionFilterDecision, CompactionFilterError&gt; {
        if entry.key.starts_with(&amp;self.prefix) {
            self.dropped_count += 1;
            // Use Tombstone to shadow older versions in lower levels
            return Ok(CompactionFilterDecision::Modify(ValueDeletable::Tombstone));
        }

        Ok(CompactionFilterDecision::Keep)
    }

    async fn on_compaction_end(&amp;mut self) -&gt; Result&lt;(), CompactionFilterError&gt; {
        tracing::info!(
            &quot;Compaction dropped {} entries with prefix {:?}&quot;,
            self.dropped_count,
            self.prefix
        );
        Ok(())
    }
}

struct PrefixDroppingFilterSupplier {
    prefix: Bytes,
}

#[async_trait]
impl CompactionFilterSupplier for PrefixDroppingFilterSupplier {
    async fn create_compaction_filter(
        &amp;self,
        _context: &amp;CompactionJobContext,
    ) -&gt; Result&lt;Box&lt;dyn CompactionFilter&gt;, CompactionFilterError&gt; {
        Ok(Box::new(PrefixDroppingFilter {
            prefix: self.prefix.clone(),
            dropped_count: 0,
        }))
    }
}

// Usage
let db = Db::builder(&quot;mydb&quot;, object_store)
    .with_compaction_filter_supplier(Arc::new(PrefixDroppingFilterSupplier {
        prefix: Bytes::from_static(b&quot;temp:&quot;),
    }))
    .build()
    .await?;

Impact Analysis

SlateDB features and components that this RFC interacts with. Check all that apply.

Core API & Query Semantics

Basic KV API (get/put/delete)
Range queries, iterators, seek semantics
Range deletions
Error model, API errors

Consistency, Isolation, and Multi-Versioning

Transactions - consistency may be affected when compaction filters are enabled (see Limitations).
Snapshots - consistency may be affected when compaction filters are enabled (see Limitations).
Sequence numbers

Time, Retention, and Derived State

Logical clocks
Time to live (TTL) - built-in TTL runs before user filters; expired entries are already removed
Compaction filters - this RFC
Merge operator - filters run after merge resolution
Change Data Capture (CDC)

Metadata, Coordination, and Lifecycles

Compaction

Storage Engine Internals

Ecosystem & Operations

CLI tools
Language bindings (Go/Python/etc)
Observability (metrics/logging/tracing)

Operations

Performance & Cost

Latency: filter() is async but called per-entry - minimize I/O in hot path
Throughput: For best performance, batch I/O in create_compaction_filter() or cache decisions
Object-store requests: No direct impact; filters operate on in-memory data
Space amplification: Drop/Modify(Tombstone) decisions reduce space; Modify(Value) may increase or decrease.
Zero overhead when disabled: Users who do not configure a filter are not impacted.

Well-implemented filters have minimal overhead on compaction throughput.

Observability

Metrics: No new counters.
Logging: Filter errors logged at WARN level.
Configuration changes: New compaction_filter_supplier field in Settings.

Compatibility

Existing data: no change.
Public APIs: New optional configuration in Settings and DbBuilder. RowEntry and ValueDeletable become public.
Rolling upgrades: not needed.

Testing

Unit tests: Each decision type (Keep/Drop/Tombstone/Modify), error handling, lifecycle hooks.
Integration tests: End-to-end compaction with custom filters, verify data correctness.
Fault-injection tests: Filter errors, initialization failures.
Deterministic simulation tests: Include filter behavior in DST.
Performance tests: Benchmark compaction throughput with/without filters.

Rollout

Implementation

Core traits and iterator integration.
Make RowEntry and ValueDeletable public.
Basic tests and documentation.

Feature Flags

The compaction_filters feature flag gates the CompactionFilterSupplier trait. See Limitations for why this is behind a feature flag.

Docs Updates

Add examples to API documentation.
Update compaction documentation to describe filter integration point.

Alternatives

1. Replace RetentionIterator entirely

Deferred: Built-in retention handles subtle edge cases (snapshot barriers, merge operand expiration). Could be unified in future.

2. Batched filter API (all versions of a key at once)

Instead of filter(entry) -> Decision, provide filter(Vec<RowEntry>) -> Vec<Decision> where all versions of a key are passed together. This would:

Enable look-ahead logic (matching RetentionIterator’s current implementation)
Allow filters to make decisions based on the full version history

Deferred: Adds API complexity and potential allocations. We’re starting simple with entry-at-a-time filtering, which covers most use cases. The API can evolve if batched filtering becomes necessary.

3. Single filter instance (no factory/supplier)

Supplier provides isolation between jobs and enables single-threaded execution without synchronization.

4. Define custom types instead of using RowEntry

Using existing types (RowEntry, CompactionJobContext) reduces API surface and avoids per-entry allocations for wrapper types.

5. Built-in filter chaining

Users who need multiple filters can implement a single CompactionFilter that internally chains multiple filters. This keeps SlateDB simpler while still enabling advanced use cases.

References

GitHub Issue #225 - Original feature request
GitHub PR #224 - Previous draft implementation (closed)
Discord Discussion - Design discussion thread
RFC-0003: Timestamps and TTL - Built-in TTL implementation
RFC-0006: Merge Operator - Pattern for user-provided operators
RocksDB Compaction Filter - Reference implementation

Updates

Log major changes to this RFC over time.