Skip to content

Decoupled Pluggable Object Store Cache

Status: Accepted

Authors:

SlateDB stores its data in object storage, so every cache miss is a network call. Caching helps at two layers: a block cache for hot blocks (already pluggable via Foyer) and an object store cache for SST file parts on local storage. The object store cache today ships inside the SlateDB crate as CachedObjectStore. When enabled, it always admits fetched bytes to the local cache on a read miss, and a single cache_puts: bool knob toggles whether PUTs are also written through to the local cache. There is no per-object policy: WAL writes, manifest writes, compaction outputs, and memtable flushes are treated the same way.

This RFC proposes two changes:

  1. Intent protocol. Every ObjectStore call SlateDB issues will carry a typed intent describing the caller’s purpose: write kind (WAL, memtable flush, compaction output, manifest), read kind (foreground, compaction input, warmup), and an optional retry reason (CRC mismatch, block decode error, decompression error). A cache wrapper reads the intent and applies its own policy.
  2. Pluggable cache placement. CachedObjectStore will no longer be explicitly created inside SlateDB. The user passes the innermost backend to Db::builder and, optionally, a cache as an ObjectStoreWrapper: the bundled CachedObjectStore exposed as a wrapper, or any other intent-aware ObjectStore. The builder wraps the backend with instrumentation and retry, the same order it uses today, then applies the cache wrapper on top.

The protocol will be dependent on object_store::Extensions. An alternative is SlateDB-owned SlateDbObjectStore trait was considered and is described in Alternatives; this RFC proposes the Extensions route.

The block cache stays as it is. This RFC is about the object store cache.

SlateDB has two places to cache:

Object store cache. CachedObjectStore splits each SST into fixed-size part files (default 4 MB) and writes one file per part on local disk. A read for a byte range fetches whichever part files cover that range; missing part files are fetched from upstream and admitted. The cached bytes are whatever the backend returned.

Block cache. A Foyer cache stores blocks, filters, indexes and stats either in memory or tiered across memory and disk. Each cached value has a typed in-memory form (Arc<Block>, Arc<SsTableIndexOwned>, Arc<[NamedFilter]>, Arc<SstStats>) and a corresponding encode/decode pair for the disk tier. The memory tier holds the typed struct directly; the disk tier holds its encoded form (post-decompression and post-CRC) so hits in either tier skip the decompression and CRC work the object store cache pays on every read. A memory tier hit returns the existing Arc<T> with no CPU work; a disk tier hit pays one serde deserialize and one type-specific decode pass to reconstruct the struct. Foyer owns admission and eviction. This RFC does not change the block cache.

They compare as follows:

DimensionObject store cacheBlock cache (Foyer)
Cached unitPart file (default 4 MB) of an SSTOne block (~4 KiB), filter, index, or stat (typed Arc<T> in the memory tier, type-specific encoded form in the disk tier)
CPU on hitCRC check, offsets parse, decompression if compression_codec is set, plus any configured block transformerMemory tier: refcount bump. Disk tier: serde framing + one type-specific decode pass (the cached form is already decompressed and CRC, verified)
Per-SST bookkeepingOne file per part (e.g. 64 part files for a 256 MB SST at the 4 MB default)Thousands of entries
Memory footprintFile metadata onlyIn-memory index grows with cached items (can reach GBs)
Restart costZero, open files on demandRebuild the in-memory index from disk

Both caches can run together; they sit at different levels of the read path. The right configuration depends on the workload (working-set size, memory budget, read amplification, tolerance for cold-read latency). There are many valid setups: object store cache only, Foyer only, Foyer scoped to a subset of content types (e.g. filters/indexes/stats) layered over an object store cache for data part files, full Foyer over an object store cache, and so on. This RFC does not prescribe one.

The object store cache lives inside the SlateDB crate. Several problems follow:

  1. SlateDB owns policy it should not own, and the policy is coarse. The current cache exposes cache_puts: bool (write-through on or off) and treats every object identically. It cannot distinguish a short-lived WAL write from a long-lived compaction output. Refining that policy, or adding any new signal, means editing core.
  2. Two caches, two different pluggability levels. The block cache is already pluggable. The object store cache is not.
  3. Cache invalidation forces readers to hold the cache. Dropping fsync on disk cache writes (#1571) required SST readers to evict the stale entry themselves when a CRC mismatch is observed. That works, but it means the reader paths now carry a typed Arc<CachedObjectStore> next to their Arc<dyn ObjectStore> and call cache-specific methods on it. Every new component that needs similar behavior pays the same plumbing cost.
  4. Compaction is not aware of the object store cache. Compaction reads and writes go straight to the raw object store and never see the cache wrapper. To change that (e.g. to skip admitting compaction outputs, or to let compaction inputs bypass the cache), every compactor-facing component would need both an Arc<dyn ObjectStore> and an Arc<CachedObjectStore>, plus a new set of user-facing options to govern which one each call uses. Same shape as problem (3), and the same shape will reappear for any future component that wants per-call cache behavior.
  5. Cached ObjectStore is not DST friendly. CachedObjectStore directly uses spawn_blocking so it can’t be tested it in DST.

These all point at the same fix. A generic layer that abstracts the cache behind a uniform call protocol means every component talks to one ObjectStore and tags each call with its intent; the cache (or whatever wrapper the user installed) acts on the tag. No component needs to know whether a cache is present, what kind it is, or what its current policy is, and no user-facing config has to grow per-component cache toggles.

The object_store crate already provides a suitable mechanism. Its GetOptions, PutOptions, and PutMultipartOptions types each carry an extensions field. SlateDB can attach metadata to it, and a wrapper can read it back. Backends that do not know about the metadata (S3, GCS, Azure) ignore it.

  • Define a typed protocol so a cache wrapper knows what each ObjectStore call is for.
  • Allow the user to plug in their own ObjectStore (cache or otherwise) without SlateDB knowing it is there.
  • Keep the bundled CachedObjectStore available as one such wrapper, with admission policy driven by intent.
  • Defining an on-disk format for any specific cache.
  • Picking a single canonical admission or eviction policy. That is the wrapper’s job.
  • Building automatic warmup or eviction. Warmup stays caller-driven via the existing DbCacheManagerOps::warm_sst API.
  • Splitting the object store cache into its own crate. Discussed in the Open Questions.

The protocol is a small set of types. SlateDB will insert one of them into extensions on every read or write:

pub struct WriteIntent { pub kind: WriteKind }
pub enum WriteKind {
Flush,
CompactionOutput,
Manifest,
Wal,
}
pub struct ReadIntent {
pub kind: ReadKind,
pub retry: Option<RetryReason>,
}
pub enum ReadKind {
Foreground,
CompactionInput,
Warmup,
}
pub enum RetryReason {
CrcMismatch,
BlockDecodeError,
DecompressionError,
}

The protocol covers three things the wrapper needs to know:

  1. What kind of write is this? (WriteIntent) Lets the wrapper decide admission per kind. A typical policy keeps data-bearing writes (Flush) and skips short-lived or tiny ones (Wal, Manifest) and bulk ones (CompactionOutput), but the policy is the wrapper’s choice.
  2. What kind of read is this? (ReadIntent) Lets the wrapper apply different policies to foreground queries, compaction-input scans, and warmups.
  3. Is this a retry after a bad cache hit? (ReadIntent.retry) SlateDB will set this when an SST decode fails (CRC, block decode, decompression). The wrapper decides how to respond; a common action is to evict the local cache entry, fetch fresh bytes from upstream, and re-cache. Without this signal, a wrapper that keeps serving the same corrupt bytes on every retry would trap the caller in an infinite retry loop.

This RFC proposes carrying intent via object_store::Extensions. SlateDB inserts a typed intent (WriteIntent or ReadIntent) into the extensions field on every GetOptions, PutOptions, and PutMultipartOptions it builds; a wrapper reads it with extensions.get::<ReadIntent>() or get::<WriteIntent>(). No new public trait, and the protocol composes directly with any generic ObjectStore wrapper at any level of the chain.

Cost: a small per-call heap allocation overhead, described in Allocation overhead.

Two alternative interfaces were considered and rejected for this RFC: a SlateDB-owned trait that wraps an inner ObjectStore, and a SlateDB-owned trait that replaces the object_store dependency entirely. Both are described in Alternatives and remain viable future directions if allocation pressure, ecosystem coupling, or extensibility needs grow.

One workaround is needed to make this current choice valid: upstream object_store::buffered::BufWriter drops extensions on the single-PUT shutdown path. See Caveats and we need to fix it upstream.

SlateDB will support tagging every call to ObjectStore as much as possible:

  1. Tag every write with WriteIntent. Every opts-capable write (put_opts and multipart upload init) will set a WriteIntent. WAL writers use Wal. Memtable flush outputs use Flush. Compaction outputs use CompactionOutput. Manifest and compactions-state writes use Manifest. Aside from documented upstream/API gaps, opts-capable writes from SlateDB core are tagged.
  2. Tag every read with ReadIntent. Every opts-capable get_opts will set a ReadIntent. Foreground user queries use Foreground. Compaction inputs use CompactionInput. The DbCacheManagerOps::warm_sst path uses Warmup. Aside from documented upstream/API gaps, opts-capable reads from SlateDB core are tagged.
  3. Retry on validation failure with a retry tag. If decoding an SST fails with a recoverable error (CRC mismatch, bad block, bad decompression), SlateDB will reissue the read with retry = Some(reason). SlateDB may retry up to a bounded number of times before surfacing the failure. The retry tag tells the wrapper to drop the cached file rather than return the same corrupt bytes again.

Untagged calls flow through the wrapper unchanged and the wrapper applies its default behavior. The bundled CachedObjectStore treats untagged writes according to its existing cache_puts policy, preserving pre-protocol behavior for known gaps such as object_store::buffered::BufWriter.

Three ObjectStore methods cannot carry an intent because object_store 0.13 has no *_opts variant for them: delete, list, and the convenience head(path). They still flow through whatever wrapper the user installed, so a cache wrapper has everything it needs without a separate hook:

  • delete(path) is enough signal to evict the local cache entry. Garbage collection already calls ObjectStore::delete, so cache eviction on delete happens for free.
  • list(prefix) is enough for enumeration.
  • head(path) (convenience form) flows through and can be served from cached metadata. Paths that need intent on a head request can use get_opts with head set to true instead.

SlateDB talks to object storage through Arc<dyn ObjectStore>. The user passes the innermost backend to Db::builder(path, backend). That backend is a raw store (S3, GCS, Azure, InMemory, LocalFileSystem).

A cache (or any other wrapper) is supplied separately as an ObjectStoreWrapper through .with_object_store_wrapper(..). SlateDB keeps its instrumentation and retry on the backend, exactly as today, and applies the wrapper on top of them. The wrapper can be:

  • The bundled CachedObjectStore, exposed as a wrapper.
  • Any other intent-aware wrapper the user wrote.

If the user supplies no wrapper, the protocol is a no-op for caching purposes: SlateDB still attaches intents on every call, but nothing reads them.

If the user supplies a wrapper, every read and write flows through it with a typed intent on the extensions.

This keeps the physical layering identical to what SlateDB ships today. The only change is that the cache is constructed from a user-supplied wrapper rather than from Settings. See Builder layering for the exact order and The cache as a wrapper for how the bundled and custom caches both plug in.

DbBuilder, GarbageCollectorBuilder, CompactorBuilder, and DbReaderBuilder will assemble the main object store the same way, from innermost to outermost:

1. The backend the user passed (a raw store: S3, GCS, Azure,
InMemory, LocalFileSystem)
|
v
2. InstrumentedObjectStore (per-operation metrics)
|
v
3. RetryingObjectStore (transient failure retries)
|
v
4. The cache wrapper the user supplied, if any (the bundled
CachedObjectStore or a custom intent-aware wrapper)
|
v
SlateDB core (TableStore, ManifestStore, CompactionsStore, GC)

Layers 2 and 3 are SlateDB-managed and sit directly on the backend, the same order SlateDB ships today. Layer 4 is whatever cache wrapper the user supplied, applied on top. SlateDB core will not construct a cache on the user’s behalf; the bundled CachedObjectStore is one of several wrappers the user may supply.

Instrumentation and retry stay on the backend for two reasons. The metrics on layer 2 count only requests that reach the backend, so a cache hit served by layer 4 is not counted and the numbers continue to measure object store API traffic. Retry on layer 3 handles only backend errors with the policy designed for them, and the cache wrapper keeps full control of its own internal error handling.

Users who want the bundled cache supply it as a wrapper with CachedObjectStore::builder:

let cache = CachedObjectStore::builder()
.root_folder(dir)
.build();
let db = Db::builder(path, raw_backend)
.with_object_store_wrapper(cache)
.build()
.await?;

root_folder is the only required setting. Other settings (cache_puts, part_size_bytes, max_cache_size_bytes, scan_interval, max_open_file_handles) have defaults and are set via builder methods when needed. SlateDB internals (metrics_recorder, system_clock, rand) are defaulted as well and only overridden for tests or to share state with a specific Db instance.

ObjectStoreCacheOptions will live in slatedb::cached_object_store, not in Settings. SlateDB core will have no field that describes the cache; the protocol on each call is the only contract it knows.

This is the same physical order SlateDB ships today: instrumentation and retry wrap the backend, and the cache wraps them. What changes is only how the cache is constructed. Today SlateDB reads Settings::object_store_cache_options and builds CachedObjectStore itself. Under this proposal the user supplies the cache as a wrapper and SlateDB applies it at step 4.

Because retry sits below the cache wrapper, the cache keeps full control of its own internal error handling. A transient disk I/O failure inside the cache, a corrupt local part, or a missing cache file are the cache’s to resolve (fall back to upstream, evict and refetch, or surface the error), and SlateDB does not apply a backend retry policy to them. The cache’s own calls to its inner store still get backend retries, because that inner store is the retrying store, so a custom wrapper never implements retry for backend calls itself.

Two consequences worth calling out:

  • Compactor and GC builders will only add instrumentation and retry on the backend; they will not apply a cache wrapper. Users who want compactor reads or GC operations to go through their own cache supply the wrapper to those builders.
  • Manifest writes and compactions metadata writes will flow through the cache wrapper (because there is only one main store handle). They will still skip the cache because they carry WriteIntent::manifest() and the bundled wrapper short-circuits that intent. See Cache wrapper behavior.

A cache plugs in as an ObjectStoreWrapper. SlateDB owns the seam: it builds the instrumented, retrying backend first, then calls the wrapper to wrap it. This is what keeps instrumentation and retry on the backend while leaving the cache pluggable.

/// A wrapper SlateDB applies on top of the instrumented, retrying backend.
///
/// SlateDB instruments and wraps the user's backend with retry first, then
/// calls `wrap`, so the returned store sits above retry and below SlateDB core.
#[async_trait]
pub trait ObjectStoreWrapper: Send + Sync {
async fn wrap(
&self,
inner: Arc<dyn ObjectStore>,
) -> Result<Arc<dyn ObjectStore>, SlateDBError>;
}

Each builder assembles the main store the same way. The instrument and retry unit is unchanged from today; the only new step is the optional wrapper on top:

// Unchanged: Instrument then Retry over the user's backend, with the
// per-builder component and store_type labels.
let base = instrumented_retrying_object_store(
backend, &recorder, component, store_type, rand.clone(), system_clock.clone(),
);
// The cache wrapper, if supplied, wraps the instrumented, retrying backend.
let main_store = match &self.object_store_wrapper {
Some(wrapper) => wrapper.wrap(base).await?,
None => base,
};

The bundled CachedObjectStore. CachedObjectStore is now the wrapper the user constructs: it holds the cache configuration and implements ObjectStoreWrapper. The live cache ObjectStore that does the part-file I/O becomes a private inner type (CachedObjectStoreInner below); wrap builds it from the configuration and the inner store. The async setup that CachedObjectStore::from_config does today moves onto that inner type:

impl CachedObjectStore {
/// Start building the bundled cache wrapper. Holds config only, no I/O yet.
pub fn builder() -> CachedObjectStoreBuilder { /* ... */ }
}
#[async_trait]
impl ObjectStoreWrapper for CachedObjectStore {
async fn wrap(
&self,
inner: Arc<dyn ObjectStore>,
) -> Result<Arc<dyn ObjectStore>, SlateDBError> {
// `inner` is the instrumented, retrying backend for this component.
// `CachedObjectStoreInner` is the private live cache store.
let cache = CachedObjectStoreInner::from_config(inner, &self.options, /* ... */).await?;
Ok(cache)
}
}

In every case the backend, instrumentation, and retry are assembled by SlateDB in the same order it uses today. The cache is the one wrapper that moves out of SlateDB and into a user-supplied ObjectStoreWrapper.

The bundled CachedObjectStore initially keeps its current behavior:

  • Reads always admit fetched bytes to the local cache on a miss.
  • Writes follow the existing cache_puts: bool master switch (off by default; when on, every PUT is written through).
  • Multipart puts pass through (never cached), as today.

Adopting the intent protocol does not change any of that on day one. What it does change is what the wrapper can do without further plumbing into SlateDB core. Each of these evolutions is now expressible as a small, isolated wrapper change, and each can land as its own follow-up:

intentWrapper change
WriteIntent::wal() on PUT with cache_putsSkip the local write; forward to upstream only.
WriteIntent::manifest() on PUT with cache_putsSkip the local write; forward to upstream only.
WriteIntent::compaction_output() on PUTAdd option to cache on compaction
ReadIntent::compaction_input() on GETAdd option to skip cache lookup and skip miss-admit; forward to upstream.
ReadIntent.retry = Some(_) on GETDelete the cache entry for path, then run the usual cache-aware fetch. Closes problem 3; lets the fsync workaround be removed.

Each future change is local to the wrapper. SlateDB core’s contract (tag every call with intent) does not move.

Concrete policy choices (admission per kind, part sizes, prefetch shape, on-disk layout, eviction strategy) are owned by the wrapper. Users who want different choices implement their own ObjectStore and pass it to Db::builder.

Simplest form with the bundled cache:

let backend: Arc<dyn ObjectStore> = Arc::new(InMemory::new());
let cache = CachedObjectStore::builder()
.root_folder(cache_dir)
.build();
let db = Db::builder("my-db", backend)
.with_object_store_wrapper(cache)
.build()
.await?;

With more options set:

let cache = CachedObjectStore::builder()
.root_folder(cache_dir)
.cache_puts(true)
.part_size_bytes(8 * 1024 * 1024)
.max_cache_size_bytes(32 * 1024 * 1024 * 1024)
.build();

For tests or advanced wiring (shared metrics recorder, mock clock, deterministic RNG):

let cache = CachedObjectStore::builder()
.root_folder(cache_dir)
.metrics_recorder(shared_recorder)
.system_clock(mock_clock)
.rand(seeded_rand)
.build();

Tagging a single call pays three small heap allocations: the Box<HashMap> for the lazily-initialized Extensions map, the HashMap’s bucket array on first insert, and the Box<val> for the intent itself. None of the three can be reused across calls because ObjectStore::get_opts and friends take GetOptions by value, and Extensions::clone() is a deep clone of all three pieces.

This does not affect read latency in any meaningful way next to a network or local-disk round trip. On hot paths (compaction reading thousands of blocks, warmup loops issuing many GETs) the aggregate allocator pressure is non-zero but well below the cost of the I/O itself.

A future optimization, if allocations come to matter, is to switch to SlateDbObjectStore (zero allocation overhead, see Alternatives) or to a static-intent-per-handle arrangement if object_store ever changes get_opts to take &GetOptions.

  • Upstream BufWriter extensions bug. SST writes go through object_store::buffered::BufWriter with extensions attached via .with_extensions(..). BufWriter::poll_shutdown drops extensions on the single-PUT path (payload fits in capacity); the multipart overflow path forwards them. Tracked as apache/arrow-rs-object-store#735. Until that fix lands, SST writes below the default 10 MB BufWriter capacity will reach a wrapper untagged. Two matrix tests will document the bug (bufwriter_small_payload_drops_write_intent, bufwriter_large_payload_carries_write_intent) and act as regression catches for when the upstream fix lands.

    Practical effect: in production, small SST writes (memtable flushes that fit in 10 MB, occasional small compaction outputs) may reach the wrapper untagged. Since the bundled CachedObjectStore initially keeps its current policy, those writes follow the pre-protocol behavior. WAL and Manifest writes do not go through BufWriter and so will always carry their intent correctly.

  • Multipart-part-level intent on the wire. WriteIntent attaches at put_multipart_opts init, not on individual put_part calls. A cache wrapper that wraps MultipartUpload (which it must do anyway to observe part bytes) holds the intent in its own state and applies it to every part. Wrappers that do not wrap MultipartUpload see only the init call. The bundled cache will not admit multipart uploads, so this gap does not affect it.

  • Basic KV API (get/put/delete)
  • Range queries, iterators, seek semantics
  • Range deletions
  • Error model, API errors

Consistency, Isolation, and Multi-Versioning

Section titled “Consistency, Isolation, and Multi-Versioning”
  • Transactions
  • Snapshots
  • Sequence numbers
  • Time to live (TTL)
  • Compaction filters
  • Merge operator
  • Change Data Capture (CDC)
  • Manifest format: unchanged. Manifest writes will carry WriteIntent::manifest() via the slatedb-txn-obj constructor.
  • Checkpoints
  • Clones
  • Garbage collection: GC deletes will flow through the wrapper, triggering cache eviction as a side effect. GC will use the same consolidated main object store handle.
  • Database splitting and merging
  • Multi-writer
  • Compaction state persistence: writes will carry WriteIntent::manifest() via the compactions_store constructor.
  • Compaction filters
  • Compaction strategies
  • Distributed compaction
  • Compactions format: compaction-output SST writes will carry WriteIntent::compaction_output(); compaction-input reads will carry ReadIntent::compaction_input().
  • Write-ahead log (WAL): WAL writes will carry WriteIntent::wal() so wrappers can skip admitting them.
  • Block cache: unchanged; the in-process block cache continues to operate independently.
  • Object store cache: today’s CachedObjectStore will be decoupled from core construction and made user-constructible. Its pre-existing cache_puts: bool master switch and admission behavior are preserved initially.
  • Indexing (bloom filters, metadata)
  • SST format or block format: the decode path will issue retry-tagged reads on recoverable validation failures.
  • CLI tools
  • Language bindings (Go/Python/etc)
  • Observability (metrics/logging/tracing): no new SlateDB metrics. The bundled cache wrapper will continue to emit its existing stats; wrappers own their own observability.

Initially, the measurable runtime change is per-call tagging overhead:

  • CPU and memory pressure. Tagging adds a small per-call heap allocation overhead. Latency against network or local-disk I/O is unaffected, and aggregate allocator traffic on compaction or warmup stays well below the I/O cost. See Allocation overhead.

Latency, throughput, and object-store cost do not otherwise change initially because the bundled wrapper keeps its current behavior (see Cache wrapper behavior). They can evolve as the follow-ups in that section land:

  • Skipping WAL / manifest / compaction-output write-through reduces local disk writes when cache_puts: true and leaves more room for hot data.
  • Bypassing compaction-input reads keeps one-shot bulk reads out of the cache, trading cache-hit savings on compactor reads for a smaller working set on foreground reads.
  • Evicting on retry-tagged reads lets the fsync workaround go away, removing one local disk sync per admitted write at the cost of one extra upstream GET on the rare CRC-mismatch path.
  • SlateDB will add no new metrics for the protocol itself.
  • The bundled CachedObjectStore retains its existing hit-rate, eviction, and admission stats.
  • User-supplied wrappers own their own metrics.
  • API. Users who set Settings::object_store_cache_options.root_folder to enable the bundled cache will need to construct a CachedObjectStore::builder() and supply it through Db::builder(..).with_object_store_wrapper(..) (the auto-construct convenience is removed when the field moves out of Settings). The bundled cache’s initial admission behavior, including cache_puts, is preserved.
  • On-disk format. Unchanged. The disk format used by CachedObjectStore stays the same, so cached files survive the migration.
  • Wire format. Unchanged.
  • Wrappers and backends that ignore the protocol. Unaffected.

The work spans two crates: slatedb-txn-obj and slatedb. slatedb depends on slatedb-txn-obj, so the Extensions support on the slatedb-txn-obj protocol and boundary objects lands first; the SlateDB-side changes that use it follow.

Keep the cache in SlateDB. Works today. Locks policy into SlateDB, makes it hard to ship alternative caches, and forces every policy change through a SlateDB release. Rejected.

A separate SlateDbObjectStore trait that wraps an inner ObjectStore. Define SlateDB’s own storage interface as a distinct trait whose methods take intents as regular arguments, but keep object_store as the backend layer the user constructs:

pub trait SlateDbObjectStore {
async fn get(&self, path: &Path, range: Range<usize>, intent: ReadIntent) -> Result<Bytes>;
async fn put(&self, path: &Path, payload: PutPayload, intent: WriteIntent) -> Result<()>;
async fn delete(&self, path: &Path) -> Result<()>;
// ...
}

Zero allocation overhead and SlateDB-owned. SlateDB still depends on object_store and the user still constructs an ObjectStore backend to hand in (wrapped by a default SlateDbObjectStore adapter, or by a custom one). Trade-off: a new public trait that SlateDB must version, and generic intermediate wrappers (rate limiter, encryption) must either sit below the cache or be rewritten as SlateDbObjectStore-aware versions.

Not the proposed approach. The Extensions allocation cost is dwarfed by the I/O on every realistic call. If that changes, the migration is mechanical: replace extensions.insert(intent) with a typed argument and extensions.get::<T>() with a method parameter.

Intent-aware cache policy as a thin router above CachedObjectStore. Keep CachedObjectStore as the physical cache, add a layer above that decides per-intent whether to use the cache, look up only, bypass, evict and refetch, or write-through; the policy trait would be the public surface.

Not the proposed approach. Reasons: the bundled CachedObjectStore can do the routing internally based on intent, the proposed policy trait would still require typing the routing decisions, and allowing users to bring a fully custom cache implementation (different on-disk format, io_uring, sharding, O_DIRECT) is more flexible than letting them substitute only the policy.

A SlateDB-owned trait that does not wrap ObjectStore at all. A bigger version of the previous alternative. SlateDB defines its own storage trait (SlateDbObjectStore) and ships built-in adapters (e.g. for object_store and opendal). The object_store crate is no longer a public dependency: users pick from SlateDB’s adapters or implement the trait themselves against a raw SDK. Zero allocation overhead, full control over the storage contract (batch ops, richer cache hints, per-call deadlines, etc.), and freedom to evolve without coordinating with upstream.

Cost: SlateDB owns the adapter surface and any future backend shape changes, and users who want a backend not yet adapted have to write the adapter themselves. The biggest structural change of the three trait options, and the biggest ownership commitment. Not the proposed approach, kept on the table as a future direction if ecosystem coupling or extensibility needs grow.

Defer to OpenDAL’s cache layer. OpenDAL recently merged a cache layer RFC (apache/opendal#6297) that supports chunk caching. With the protocol in this RFC, nothing prevents a user from wiring an OpenDAL-backed cache as their ObjectStore. If OpenDAL’s cache layer reads typed cache hints from Extensions or grows an intent-equivalent argument, this RFC’s protocol bridges to it directly. SlateDB does not need to choose; users do.

Static intent per ObjectStore handle (requires upstream API change). Construct multiple internal handles each preconfigured with a fixed intent, sharing pre-built GetOptions / PutOptions across many calls. Only valid if ObjectStore::get_opts takes &GetOptions by reference instead of by value. The upstream change is on the object_store crate; if it ever lands, this becomes the cleanest allocation-free path that keeps the Extensions protocol.

  • Should the object store cache live in its own crate (e.g. slatedb-obj-cache)? Not proposed here. With the protocol in place and the cache no longer required for correctness, but it seems like the right step to do after the RFC is executed.
  • Allocation cost in practice. Not measured. Worth profiling on compaction and warmup paths before committing to further optimization (or before switching to SlateDbObjectStore).