Skip to content

Custom Comparators

<!— Replace “RFC Title” with your RFC’s short, descriptive title. —>

Status: Rejected

Authors:

> [!CAUTION] > This RFC was rejected. See the Rejection Analysis for more details.

<!— TOC —><a name=“summary-motivation”></a>

This RFC proposes a new API for customizing the ordering of keys in SlateDB. The ordering of keys in Slate is a critical component of the performance characteristics of the database and its ability to efficiently serve range queries.

The current implementation of SlateDB uses a lexicographic ordering of keys. This is a good default for many use cases, but it may not always be the best choice. Some examples of use cases where a different ordering is desirable include:

  • Correct semantics for non-string keys (e.g. sorting floating point numbers)
  • Sorting composite keys (e.g. a composite key of [user_id, timestamp] may sort ascending by user_id and descending by timestamp)
  • Domain-specific ordering (e.g. Z-ordering for geographic data)

<!— TOC —><a name=“goals”></a>

  • Provide a new API for customizing the ordering of keys in SlateDB
  • Maintain backwards compatibility by defaulting to the existing lexicographic ordering
  • Best-effort enforcing of the ordering invariant (e.g. a comparator doesn’t change after first deployment)

<!— TOC —><a name=“non-goals”></a>

  • Providing default implementations for non-lexicographic use cases
  • Supporting changes to the ordering invariant after first deployment

<!— TOC —><a name=“design”></a>

<!— A detailed description of the proposed change. Include diagrams, examples, schemas, and pseudo-code as appropriate. —>

<!— TOC —><a name=“api”></a>

We will introduce a new interface for customizing the ordering of keys in SlateDB, the KeyComparator trait:

use std::cmp::Ordering;
use std::sync::Arc;
/// Custom key comparison trait.
/// WARNING: Changing comparators on existing DB causes corruption.
pub trait KeyComparator: Send + Sync + std::fmt::Debug {
/// Compare two keys.
fn compare(&amp;self, a: &amp;[u8], b: &amp;[u8]) -&gt; Ordering;
/// Unique identifier stored in manifest for validation.
fn name(&amp;self) -&gt; &str;
}
pub type KeyComparatorType = Arc&lt;dyn KeyComparator&gt;;

This will be passed in when constructing a new Db instance with the DbBuilder:

let db = DbBuilder::new(path)
// ...
.with_key_comparator(key_comparator)
.build();

<!— TOC —><a name=“manifest”></a>

We will introduce a new field to the manifest format to store the name of the comparator:

table ManifestV1 {
// ...
key_comparator_name: string;
}

This field will be used to validate that the comparator is not changed after first deployment. if it is set in the manifest, a comparator with the same name must be used to create the database in runtime (no comparator will also cause a panic). If it is not set and the manifest already exists, the database will panic if a custom comparator is supplied at runtime.

<!— TOC —><a name=“implementation”></a>

We’ll need to modify each of these places to use the new KeyComparator passed in from the DbBuilder. Note that because it’s a trait, we’ll need to use dynamic dispatch to call the comparator.

> [!NOTE] > Since comparisons happen on a hot path, we may want to avoid using the dyn KeyComparator > if not necessary to avoid the overhead of dynamic dispatch. Instead, we can choose to > use an Option&lt;KeyComparatorType&gt; to store the comparator and inline the default > lexicographic comparator if not set. > > The alternative would be to always use the dyn KeyComparator and just implement a default > implementation of it. This would be simpler, but it would add a small overhead to every > comparison.

The following table was generated via an agent, but it seems to be accurate as far as I can tell:

LocationUsage
slatedb/src/bytes_range.rs:8<br>slatedb/src/comparable_range.rs:8BytesRange and its ComparableRange backing types rely on the default Ord implementation for Bytes when comparing start/end bounds (via cmp_bound). Their logic powers range.contains, intersect, and conversions that every range-aware component (memtables, SST views, sorted runs, iterator seek validation) uses.
slatedb/src/mem_table.rs:36,189<br>slatedb/src/batch.rs:153,409SequencedKey::cmp keeps the memtable SkipMap and write-batch BTreeMap ordered by ascending user key and descending sequence number. MemTableIterator::seek and WriteBatchIterator::seek compare RowEntry.key / key.user_key to next_key, and WriteBatch::remove_ops_by_key issues start..=end range deletes that depend on this ordering.
slatedb/src/merge_iterator.rs:21-119MergeIterator keeps a binary heap ordered by RowEntry.key.cmp (and reverse seq on ties) and short-circuits child seeks when next_kv.key &gt;= next_key. This governs how memtable/WAL/SST iterators are merged for reads and compactions.
slatedb/src/block_iterator.rs:73-135<br>slatedb/src/sst_iter.rs:93-552<br>slatedb/src/partitioned_keyspace.rs:13-80Block iterators skip entries with kv.key &lt; next_key. Above the block layer, SstView::contains/key_exceeds, InternalSstIterator::blocks_covering_view, and SstIterator::seek compare requested keys to SST range bounds and block first-keys (via the partitioned keyspace helpers) to jump to the first block whose contents are &gt;= next_key.
slatedb/src/db_state.rs:137-303<br>slatedb/src/sorted_run_iterator.rs:176-188SsTableHandle::intersects_range, SortedRun::find_sst_with_range_covering_key, and table_idx_covering_range compare SST start keys and requested ranges to decide which files overlap a read/compaction. SortedRunIterator::seek then drops whole SSTs whose first_key &lt; next_key before seeking inside the current SST iterator.
slatedb/src/db_iter.rs:56-69<br>slatedb/src/db_iter.rs:341-357DbIteratorRangeTracker maintains the min/max key observed via &lt;/&gt; comparisons for SSI conflict detection, and DbIterator::seek enforces that callers only seek forward and stay inside the configured BytesRange by comparing next_key to last_key and the range bounds.
slatedb/src/reader.rs:512-528The SST-building helper sorts buffered RowEntrys with a.key.cmp(&amp;b.key) before emitting a table, modeling the requirement that compaction/flush writers provide keys in comparator order.

For use of the comparator in the memtable, we will need to modify the SequencedKey type to include a reference to the comparator and override the Ord implementation to use it. All other locations can use the comparator reference directly.

<!— TOC —><a name=“impact-analysis”></a>

SlateDB features and components that this RFC interacts with. Check all that apply.

<!— TOC —><a name=“core-api-query-semantics”></a>

  • Basic KV API (get/put/delete)
  • Range queries, iterators, seek semantics
  • Range deletions
  • Error model, API errors

Range query semantics will be affected by this change since the comparator is used to determine the order of keys in the range.

<!— TOC —><a name=“consistency-isolation-and-multi-versioning”></a>

Consistency, Isolation, and Multi-Versioning

Section titled “Consistency, Isolation, and Multi-Versioning”
  • Transactions
  • Snapshots
  • Sequence numbers

DbIteratorRangeTracker maintains the min/max key observed via </> comparisons for SSI conflict detection. Since the comparator is used to determine the order of keys, the min/max key observed will be in the order of the comparator (this should be consistent with the query semantics so the change is mostly transparent to the user).

<!— TOC —><a name=“time-retention-and-derived-state”></a>

  • Logical clocks
  • Time to live (TTL)
  • Compaction filters
  • Merge operator
  • Change Data Capture (CDC)

N/A - the change should not affect time, retention, and derived state.

<!— TOC —><a name=“metadata-coordination-and-lifecycles”></a>

  • Manifest format
  • Checkpoints
  • Clones
  • Garbage collection
  • Database splitting and merging
  • Multi-writer

We will add a new field to the manifest format to store the name of the comparator to serve as a basic validation mechanism that ensures the comparator is not changed after first deployment (this is rudimentary and does not prevent changes if the comparator uses the same name).

<!— TOC —><a name=“compaction”></a>

  • Compaction state persistence
  • Compaction filters
  • Compaction strategies
  • Distributed compaction
  • Compactions format

Compactions will need to use the comparator to sort the keys in the SSTs.

<!— TOC —><a name=“storage-engine-internals”></a>

  • Write-ahead log (WAL)
  • Block cache
  • Object store cache
  • Indexing (bloom filters, metadata)
  • SST format or block format

The indexes and SSTs will be sorted by the comparator

<!— TOC —><a name=“ecosystem-operations”></a>

  • CLI tools
  • Language bindings (Go/Python/etc)
  • Observability (metrics/logging/tracing)

<!— TOC —><a name=“operations”></a>

<!— TOC —><a name=“performance-cost”></a>

<!— Describe performance and cost implications of this change. —>

There will potentially be a performance impact in situations that perform many comparisons, such as compactions due to the cost of dynamic dispatch. As part of implementation we will benchmark the performance impact and decide whether or not to use the Option&lt;KeyComparatorType&gt; approach outlined in the implementation section above.

<!— TOC —><a name=“observability”></a>

<!— Describe any operational changes required to support this change. —>

This will require a new configuration option to set the key comparator.

<!— TOC —><a name=“compatibility”></a>

Existing deployments will set an empty string for the key comparator name in the manifest the first time it is deployed with a version including this RFC. If the field does not exist but the manifest already does, a custom comparator will not be allowed.

<!— TOC —><a name=“testing”></a>

We should have comprehensive tests to ensure we don’t break the ordering invariant going forward. All main code paths should be covered, in particular queriyng and compactions.

<!— TOC —><a name=“rollout”></a>

<!— Describe the plan for rolling out this change to production. —>

  • Milestones / phases:
  • Feature flags / opt-in:
  • Docs updates:

<!— TOC —><a name=“alternatives”></a>

N/A - the only alternative is to not use a custom comparator.

<!— TOC —><a name=“open-questions”></a>

The biggest option question is whether to use the Option&lt;KeyComparatorType&gt; approach or the dyn KeyComparator approach (see implementation section above).

<!— TOC —><a name=“references”></a>

<!— Bullet list of related issues, PRs, RFCs, papers, docs, discord discussions, etc. —>

  • github.com/slatedb/slatedb/issues/663
  • github.com/crossbeam-rs/crossbeam/issues/1180

<!— TOC —><a name=“updates”></a>

<!— TOC —><a name=“rejection-analysis”></a>

After discussion among maintainers, this RFC was rejected for the following reasons:

  1. Complexity vs. benefit: Custom comparators add significant implementation complexity (dynamic dispatch on hot paths, validation logic, manifest changes) for marginal benefit (there are alternative ways to structure most keys without custom comparators).

  2. Bug potential: RocksDB and Pebble have experienced bugs and regressions related to custom comparators. The additional code paths and invariants are difficult to test exhaustively.

  3. Industry precedent: FoundationDB, DynamoDB, TiKV and CockroachDB intentionally do not expose custom comparators—bytewise comparison only, by design. Users encode keys to sort correctly using their well-documented tuple layer. This approach has proven successful at scale.

  4. Use cases are solvable with encoding: All motivating use cases can be addressed through proper key encoding:

    • Floating points: flip sign bit for positives, flip all bits for negatives
    • Composite keys: use 0x00 delimiters or fixed-length components
    • Descending order: XOR bytes with 0xFF or subtract from max value
    • Geospatial: Z-order and Hilbert curves encode to sortable integers
  5. Documentation over API complexity: Providing clear documentation on key encoding patterns (with links to libraries like storekey) is preferable to exposing a complex API that enables subtle data corruption if misused.

See the Data Modeling documentation for recommended key encoding patterns.

See the Github Discussion for more details.