Expose Row Information
Status: Approved
Authors:
<!— TOC —><a name=“summary”></a>
Summary
Section titled “Summary”This RFC proposes three related API improvements:
- Modify Write API Return Types: Change the return type of
db.put(),db.delete(),db.merge(), anddb.write()(including their_with_optionsvariants) fromResult<(), ...>toResult<WriteHandle, ...>to return a write handle containing the assigned sequence number and timestamps. This design allows for future extensions such asawait_durability(). - New Row Query Interface: Introduce
get_row()andget_row_with_options()for single-key row queries that returnRowEntry. For scans, addDbIterator::next_row()which returnsRowEntry(including metadata: sequence number, creation timestamp, expiration timestamp). The existingDbIterator::next()continues to returnKeyValuefor backward compatibility. - Support Query by Version: Add a
seqnumoption toSnapshotOptions, enabling users to create snapshots at specific sequence numbers and read historical versions of keys.
<!— TOC —><a name=“motivation”></a>
Motivation
Section titled “Motivation”In practical applications, users often need to retrieve metadata information for keys. Currently, SlateDB has the following pain points:
1. Unable to Query Row Information and Metadata
Users need to obtain metadata such as TTL, sequence number, and creation time, but currently no API provides this information—even the get() method only returns the value itself.
2. Lack of Feedback for Write Operations
Currently, db.put() returns Result<(), Error>. Callers cannot know the sequence number assigned to the write operation, which is crucial for debugging, auditing, and MVCC scenarios.
3. Lack of Versioned Query Capability
Users cannot query a specific historical version of a key using a sequence number, which limits application scenarios related to Multi-Version Concurrency Control (MVCC).
<!— TOC —><a name=“goals”></a>
- Enable retrieval of complete row information including metadata (supporting both single-key queries and range scans).
- Return the sequence number assigned after a write operation (put, delete, merge, or batch write).
- Support reading historical versions of a key by specifying a sequence number.
<!— TOC —><a name=“non-goals”></a>
Non-Goals
Section titled “Non-Goals”- No change/modification for underlying behavior of any API.
<!— TOC —><a name=“use-cases”></a>
Use Cases
Section titled “Use Cases”<!— TOC —><a name=“durability-guarantees”></a>
Durability Guarantees
Section titled “Durability Guarantees”Applications requiring strong durability guarantees can use the sequence number to verify that writes have been persisted:
// Write with durability trackinglet handle = db.put(b"critical_data", value).await?;let write_seqnum = handle.seqnum();
// All writes with sequence numbers <= write_seqnum are guaranteed to be durable// Applications can use this for replication, backup, or recovery checkpointscheckpoint_manager.record_durable_seqnum(write_seqnum).await?;<!— TOC —><a name=“ttl-management”></a>
TTL Management
Section titled “TTL Management”Applications can query how long until a key expires using get_row():
// Query remaining TTL for a keyif let Some(entry) = db.get_row(b"session:abc123").await? { if let Some(expire_ts) = entry.expire_ts() { let ttl_seconds = (expire_ts - current_time_micros()) / 1_000_000; println!("Key expires in {} seconds", ttl_seconds); }}<!— TOC —><a name=“design”></a>
Design
Section titled “Design”<!— TOC —><a name=“1-row-query-interface”></a>
1. Row Query Interface
Section titled “1. Row Query Interface”> [!NOTE]
> This change introduces new row query APIs that reuse the existing RowEntry type for returning complete row information. Since decoding metadata requires decoding the entire row, there is no performance penalty for returning the full row data.
Public API:
// Get complete row information for a single keypub async fn get_row<K: AsRef<[u8]>>(&self, key: K) -> Result<Option<RowEntry>, crate::Error>;
// Get row with options (new)pub async fn get_row_with_options<K: AsRef<[u8]>>( &self, key: K, options: &ReadOptions,) -> Result<Option<RowEntry>, crate::Error>;
// Scan API (unchanged) - returns DbIteratorpub async fn scan<K, T>(&self, range: T) -> Result<DbIterator, crate::Error>where K: AsRef<[u8]> + Send, T: RangeBounds<K> + Send;
// DbIterator methodsimpl DbIterator { // Returns KeyValue for normal scan usage (default behavior) pub async fn next(&mut self) -> Result<Option<KeyValue>, crate::Error>;
// Returns complete row information including metadata pub async fn next_row(&mut self) -> Result<Option<RowEntry>, crate::Error>;}> [!NOTE]
> Implementation Note: Internally, the existing next_key_value() method is renamed to next_row() and modified to return RowEntry. The public next() method calls next_row() internally, extracts the KeyValue from the RowEntry, and returns it.
// RowEntry is already public; fields are pub(crate) with getter methodspub struct RowEntry { pub(crate) key: Bytes, pub(crate) value: ValueDeletable, pub(crate) seq: u64, pub(crate) create_ts: Option<i64>, pub(crate) expire_ts: Option<i64>,}
impl RowEntry { pub fn key(&self) -> &[u8] { &self.key } pub fn value(&self) -> &ValueDeletable { &self.value } pub fn seq(&self) -> u64 { self.seq } pub fn create_ts(&self) -> Option<i64> { self.create_ts } pub fn expire_ts(&self) -> Option<i64> { self.expire_ts }}
// ValueDeletable type definition (unchanged)pub enum ValueDeletable { Value(Bytes), // Regular value Merge(Bytes), // Merge operation value Tombstone, // Deleted key marker}> [!IMPORTANT]
> ValueDeletable Behavior in next_row()
>
> While ValueDeletable has three variants (Value, Merge, Tombstone), neither next() nor next_row() will return entries with Tombstone in practice:
> - Tombstones are filtered out by the underlying iterator
> - next() returns KeyValue (extracts the value from ValueDeletable, so users don’t need to match on it)
> - next_row() returns RowEntry, where users will only see ValueDeletable::Value or ValueDeletable::Merge
> - For merge operations: if a base value exists, returns Value with the computed result; otherwise returns Merge with the final merge value
Usage Example:
// Get complete row informationlet row = db.get_row(b"key").await?;if let Some(entry) = row { println!("Key: {:?}", entry.key()); println!("Seq: {}, Created: {:?}, Expires: {:?}", entry.seq(), entry.create_ts(), entry.expire_ts());
// Handle ValueDeletable variants (Tombstone will never appear here) match entry.value() { ValueDeletable::Value(v) => println!("Value: {:?}", v), ValueDeletable::Merge(m) => println!("Merge value: {:?}", m), ValueDeletable::Tombstone => unreachable!("get_row() never returns Tombstone"), }}
// Normal scan usage - returns KeyValue (no need to match on ValueDeletable)let mut iter = db.scan(b"a"..b"z").await?;while let Some((key, value)) = iter.next().await? { println!("Key: {:?}, Value: {:?}", key, value);}
// Scan with complete row information - returns RowEntrylet mut iter = db.scan(b"a"..b"z").await?;while let Some(row_entry) = iter.next_row().await? { match row_entry.value() { ValueDeletable::Value(v) => { println!("Key: {:?}, Value: {:?}, Seq: {}", row_entry.key(), v, row_entry.seq()); } ValueDeletable::Merge(m) => { println!("Key: {:?}, Merge: {:?}, Seq: {}", row_entry.key(), m, row_entry.seq()); } ValueDeletable::Tombstone => unreachable!(), }}Implementation Details:
-
Add
DbIterator::next_row()Method:- Rename the existing internal
next_key_value()method tonext_row()and make it public, with return typeResult<Option<RowEntry>, crate::Error>. - The public
next()method callsnext_row()internally, extractsKeyValuefrom theRowEntry, and returns it (maintaining backward compatibility). - Since decoding metadata requires decoding the entire row, there is no performance penalty for providing both methods.
- Rename the existing internal
-
RowEntry.valueBehavior:RowEntry.valueremains asValueDeletabletype (no type changes needed).- Both
next()andnext_row()will never return entries withTombstonevalues:- The underlying iterator already filters out tombstones.
- For Merge Operations:
- If merge operator is configured and merges are resolved, returns
ValueDeletable::Valuewith the computed result. - If no base value exists for merges, returns
ValueDeletable::Mergewith the final merge value. - This is consistent with how
KeyValueIterator::next()handles merges (see line 42-44 in iter.rs).
- If merge operator is configured and merges are resolved, returns
next()extracts the value fromValueDeletableand returns it asBytesin theKeyValuetuple, so normal scan usage doesn’t need to match onValueDeletable.
-
Implement
get_row()Usingscan():get_row(key)internally callsscan(key..=key)to get aDbIterator.- Then calls
next_row()on the iterator to get theRowEntry. - This reuses existing scan logic and avoids code duplication.
-
Multi-Version Behavior: Both
next()andnext_row()return the latest visible version of each key. If a key has multiple versions at different sequence numbers, only the version visible at the current time (or specified snapshot sequence number) is returned.
<!— TOC —><a name=“2-modify-putwrite-return-types”></a>
2. Modify Put/Write Return Types
Section titled “2. Modify Put/Write Return Types”> [!WARNING]
> Breaking Change: All existing code using db.put, db.delete, db.merge, and db.write (and their _with_options variants) must be updated to handle the new return type.
WriteHandle Structure:
/// Handle returned from write operations, containing metadata about the write./// This structure is designed to be extensible for future enhancements.pub struct WriteHandle { /// The sequence number of this entry. seq: u64, /// The creation timestamp (if set). create_ts: Option<i64>,}
impl WriteHandle { /// Returns the sequence number assigned to this write operation. pub fn seqnum(&self) -> u64 { self.seq }
/// Returns the creation timestamp assigned to this write operation. pub fn create_ts(&self) -> Option<i64> { self.create_ts }
// Future extensions can be added here, for example: // pub async fn await_durability(&self) -> Result<(), Error> { ... }}
> [!NOTE]> `WriteHandle` does not return `expire_ts` because each operation in a batch can have a different expiration time (or no expiration). Returning a single `expire_ts` for a batch write would be ambiguous or incorrect. Users who need `expire_ts` should query the row using `get_row()` or `next_row()`.Single Key Operations:
// Old interfacepub async fn put<K, V>(&self, key: K, value: V) -> Result<(), crate::Error>pub async fn delete<K>(&self, key: K) -> Result<(), crate::Error>pub async fn merge<K, V>(&self, key: K, value: V) -> Result<(), crate::Error>
// New interface. Returns WriteHandle with the assigned sequence number.pub async fn put<K, V>(&self, key: K, value: V) -> Result<WriteHandle, crate::Error>pub async fn delete<K>(&self, key: K) -> Result<WriteHandle, crate::Error>pub async fn merge<K, V>(&self, key: K, value: V) -> Result<WriteHandle, crate::Error>Batch Operations:
// Old interfacepub async fn write(&self, batch: WriteBatch) -> Result<(), crate::Error>
// New interface. Returns WriteHandle with the batch commit sequence number.pub async fn write(&self, batch: WriteBatch) -> Result<WriteHandle, crate::Error>Transaction Commit Operations:
// Old interfacepub async fn commit(self) -> Result<(), crate::Error>pub async fn commit_with_options(self, options: &WriteOptions) -> Result<(), crate::Error>
// New interface. Returns WriteHandle with the commit sequence number.pub async fn commit(self) -> Result<WriteHandle, crate::Error>pub async fn commit_with_options(self, options: &WriteOptions) -> Result<WriteHandle, crate::Error>Migration Example:
// Single key operation - get sequence number and timestampslet handle = db.put(b"key", b"value").await?;println!("Seq: {}", handle.seqnum());println!("Created at: {:?}", handle.create_ts());
// Batch operationlet mut batch = WriteBatch::new();batch.put(b"key1", b"value1");batch.put(b"key2", b"value2");batch.delete(b"key1");
let handle = db.write(batch).await?;println!("Batch committed at seq: {}", handle.seqnum());// All operations in the batch share this sequence number and timestamps
// If you need the full RowEntry, use get_row with the seqnumlet row = db.get_row(b"key").await?;if let Some(entry) = row { assert_eq!(entry.seq(), handle.seqnum()); assert_eq!(entry.create_ts(), handle.create_ts());}
// Option 2: Ignore return values (if you don't need the metadata)let _ = db.put(b"key", b"value").await?;
let mut batch2 = WriteBatch::new();batch2.put(b"another_key", b"another_value");let _ = db.write(batch2).await?;Implementation Details:
- Modify
DbInner::write_with_optionsto returnWriteHandle(containing the assignedcommit_seqandcreate_ts). put(),delete(), andmerge()return theWriteHandlereceived fromDbInner.- Both
put,delete, andmergeoperations share the same underlying write pipeline. WriteHandlecontains the sequence number and creation timestamp assigned to the write operation.- The
create_tsis assigned at the same time as thecommit_seqduring write batch processing. - Note on Transactions: Within a transaction, the individual write operations (
put,delete,merge) do not returnWriteHandlebecause sequence numbers are not known during transaction execution. However,DbTransaction::commit()andcommit_with_options()do returnWriteHandle, allowing users to access the commit sequence number and timestamps assigned when the transaction is successfully committed.
<!— TOC —><a name=“3-support-query-by-version”></a>
3. Support Query by Version
Section titled “3. Support Query by Version”Add a seqnum field to SnapshotOptions to create snapshots at specific sequence numbers:
pub struct SnapshotOptions { pub seqnum: Option<u64>, // New: create snapshot at specific sequence number}
impl SnapshotOptions { pub fn read_at(self, seq: u64) -> Self { Self { seqnum: Some(seq), ..self } }}
// New snapshot APIpub fn snapshot_with_options(&self, options: SnapshotOptions) -> Result<DbSnapshot, Error>;Usage Example:
// Write and get version numberlet handle = db.put(b"key", b"value1").await?;let seq1 = handle.seqnum();
// Subsequent updatedb.put(b"key", b"value2").await?;
// Create snapshot at seq1let snapshot = db.snapshot_with_options(SnapshotOptions::default().read_at(seq1))?;
// Current valuelet current_value = db.get(b"key").await?;assert_eq!(current_value, Some(Bytes::from("value2")));
// Read old version via snapshotlet old_value = snapshot.get(b"key").await?;assert_eq!(old_value, Some(Bytes::from("value1")));Snapshots created with seqnum can be used with all read operations:
Usage Example:
// Write some datalet handle1 = db.put(b"key1", b"value1_v1").await?;let seq1 = handle1.seqnum();db.put(b"key2", b"value2_v1").await?;
let handle2 = db.put(b"key1", b"value1_v2").await?;let seq2 = handle2.seqnum();db.put(b"key2", b"value2_v2").await?;
// Create snapshot at seq1let snapshot = db.snapshot_with_options(SnapshotOptions::default().read_at(seq1))?;
// Normal scan at historical version via snapshotlet mut iter = snapshot.scan(b"key1"..=b"key2").await?;while let Some((key, value)) = iter.next().await? { // Will return value1_v1 and value2_v1 println!("Key: {:?}, Value: {:?}", key, value);}
// Scan with complete row information at historical versionlet mut iter = snapshot.scan(b"key1"..=b"key2").await?;while let Some(row_entry) = iter.next_row().await? { println!("Key: {:?}, Seq: {}, Value: {:?}", row_entry.key(), row_entry.seq(), row_entry.value());}Implementation Details:
-
Snapshot-Based Versioning: Extend the existing
Snapshotmechanism to support creating snapshots at specific sequence numbers viaSnapshotOptions::seqnum. -
Version Retention and Availability: > [!IMPORTANT] > Understanding Historical Version Availability > > Historical versions are not guaranteed to be available indefinitely. Version retention is currently controlled by: > > 1. Active Snapshots: Historical versions are retained as long as there are open snapshots that reference them. The compactor uses
db_state.recent_snapshot_min_seqto determine the minimum sequence number to retain. > 2. Per-Key Retention: Themin_seqis per-key, not global: > - If keyk1is written once withseq=1and never updated, it will remain available withseq=1indefinitely > - If keyk2is written withseq=2and later overwritten withseq=123456, the old version(k2, seq=2)will eventually be compacted away > - There is no single globalmin_seqthat applies to all keys > 3. Future Enhancement - Time Travel: A planned feature (tracked in #1263) will allow configuring a retention window (e.g., 24 hours) to guarantee version availability within that time period, regardless of snapshot lifecycle. > > Best Practices: > - For short-term version queries (e.g., debugging), usesnapshot_with_options()and keep the snapshot open as needed > - For long-term archival, use a separate backup mechanism rather than relying on SlateDB’s version history > - Once time travel is implemented, configure an appropriate retention window for your use case -
Snapshot Visibility:
- Creating a snapshot with
seqnumalways succeeds, even ifseqnum < min_seqorseqnum > max_seq - Read operations on the snapshot use visibility checks:
- If data has been compacted (
seq < min_seqfor that key), it is not visible - If data has not yet been written (
seq > max_seq), it is not visible - If no data satisfies the visibility condition, queries return empty results (
Noneor empty iterator)
- If data has been compacted (
- Creating a snapshot with
-
API Integration:
snapshot_with_options(SnapshotOptions::default().read_at(seq))creates a snapshot view at the specified sequence number- All read operations on the snapshot (
get,scan,get_row) will see data as of that sequence number - The snapshot can be used with
next()andnext_row()methods on iterators to retrieve key-value pairs or complete row information respectively
<!— TOC —><a name=“impact-analysis”></a>
Impact Analysis
Section titled “Impact Analysis”Breaking Changes:
- New Row Query APIs:
- New APIs introduced:
get_row(),get_row_with_options(),DbIterator::next_row(). DbIterator::next()maintains backward compatibility (still returnsResult<Option<KeyValue>, crate::Error>).- Internal
next_key_value()method renamed tonext_row()and made public, returningRowEntry. RowEntry.valueremains asValueDeletabletype (no breaking changes to the type itself).- Neither
next()nornext_row()returnTombstonevalues (tombstones are filtered out by the underlying iterator).
- New APIs introduced:
- Put/Write Return Type Modification:
- All
put(),put_with_options(),delete(),delete_with_options(),merge(),merge_with_options(),write(), andwrite_with_options()calls must be updated to handle the newResult<WriteHandle, ...>return type. - To access the sequence number, call
.seqnum()on the returnedWriteHandle. - Language bindings (Go/Python) must be updated accordingly. For example, in Go, consider exposing a
WriteHandletype with aSeqnum()method, or alternatively return(uint64, error)directly. Note that forwrite(), all operations in the batch share the same sequence number in the current implementation.
- All
Backward Compatibility:
get_row(),get_row_with_options(), andDbIterator::next_row()are new APIs.- They return the existing
RowEntrytype, which is already public. DbIterator::next()maintains backward compatibility (still returnsKeyValue).snapshot_with_options()is a new API that extends existing snapshot functionality.SnapshotOptions.seqnumallows creating snapshots at specific sequence numbers.- No impact on any storage formats (WAL/SST/Manifest).
Migration Path:
No migration needed for existing code. DbIterator::next() maintains backward compatibility:
// Existing scan code continues to work unchangedlet mut iter = db.scan(..).await?;while let Some((key, value)) = iter.next().await? { // No changes needed}
// When you need complete row information, use next_row()let mut iter = db.scan(..).await?;while let Some(row_entry) = iter.next_row().await? { // Access metadata: row_entry.seq(), row_entry.create_ts(), row_entry.expire_ts() match row_entry.value() { ValueDeletable::Value(v) => { // Use row_entry.key() and v } ValueDeletable::Merge(m) => { // Use row_entry.key() and m } ValueDeletable::Tombstone => unreachable!(), }}
// New row query APIslet row = db.get_row(b"key").await?;<!— TOC —><a name=“testing”></a>
Testing
Section titled “Testing”Unit Tests:
- Row query: Normal cases, expired keys, tombstones, merge operations, sequence number monotonicity.
- Put/Write return values: Verify correctness of sequence numbers.
- Versioned query: Read historical versions, expired key handling, behavior after Compaction.
Integration Tests:
- End-to-end workflow: Combined use of row query and versioned query.
- Versioned query across multiple storage layers.
- Consistency of metadata during concurrent writes.
Performance Testing:
get_row()performance: Since it returns complete row information, it decodes both metadata and value. Performance should be similar to existingget()operations.scan()withnext()returningKeyValue: Performance should remain unchanged (backward compatible).scan()withnext_row()returningRowEntry: Performance should be similar tonext(), as metadata is already decoded during iteration.get_with_optionswith specified snapshot sequence number.
<!— TOC —><a name=“alternatives”></a>
Alternatives
Section titled “Alternatives”1. Dedicated Specialized Interfaces (e.g., get_ttl, get_seqnum, etc.)
- Reason for Rejection:
- API Complexity: This would rapidly expand the API surface area and increase user learning costs. A single generic interface
get_rowthat returns complete row information (including metadata) maintains API simplicity. - Atomicity: Some use cases require retrieving both the key/value and its metadata atomically. Separate methods would prevent this (unless a snapshot is created). Returning the complete
RowEntrynaturally supports atomic access.
- API Complexity: This would rapidly expand the API surface area and increase user learning costs. A single generic interface
2. Introduce put_with_metadata() Method
- Reason for Rejection: Introducing a new API would lead to functional overlap. Modifying the existing
put()return type, while a breaking change, maintains architectural simplicity and consistency.
3. Return Only Metadata (RowMetadata) Instead of Complete Row (RowEntry)
- Reason for Rejection: Since decoding metadata requires decoding the entire row, returning only metadata would not improve performance. Returning the complete
RowEntryprovides more value to users while utilizing an existing public type, reducing API complexity.
4. No Change (Status Quo)
- Problem: Users cannot access physically stored metadata (TTL, sequence number, etc.), limiting important scenarios like debugging, auditing, and MVCC implementations.