Decoupled Pluggable Object Store Cache
Status: Accepted
Authors:
Summary
Section titled “Summary”SlateDB stores its data in object storage, so every cache miss is a network
call. Caching helps at two layers: a block cache for hot blocks (already
pluggable via Foyer) and an object store cache for SST file parts on local
storage. The object store cache today ships inside the SlateDB crate as
CachedObjectStore. When enabled, it always admits fetched bytes to the local
cache on a read miss, and a single cache_puts: bool knob toggles whether
PUTs are also written through to the local cache. There is no per-object
policy: WAL writes, manifest writes, compaction outputs, and memtable flushes
are treated the same way.
This RFC proposes two changes:
- Intent protocol. Every
ObjectStorecall SlateDB issues will carry a typed intent describing the caller’s purpose: write kind (WAL, memtable flush, compaction output, manifest), read kind (foreground, compaction input, warmup), and an optional retry reason (CRC mismatch, block decode error, decompression error). A cache wrapper reads the intent and applies its own policy. - Pluggable cache placement.
CachedObjectStorewill no longer be explicitly created inside SlateDB. The user passes the innermost backend toDb::builderand, optionally, a cache as anObjectStoreWrapper: the bundledCachedObjectStoreexposed as a wrapper, or any other intent-awareObjectStore. The builder wraps the backend with instrumentation and retry, the same order it uses today, then applies the cache wrapper on top.
The protocol will be dependent on object_store::Extensions. An alternative is
SlateDB-owned SlateDbObjectStore trait was considered and is described in
Alternatives; this RFC proposes the Extensions route.
The block cache stays as it is. This RFC is about the object store cache.
Motivation
Section titled “Motivation”Two layers of caching
Section titled “Two layers of caching”SlateDB has two places to cache:
Object store cache. CachedObjectStore splits each SST into fixed-size
part files (default 4 MB) and writes one file per part on local disk. A read
for a byte range fetches whichever part files cover that range; missing part
files are fetched from upstream and admitted. The cached bytes are whatever the
backend returned.
Block cache. A Foyer cache stores blocks, filters, indexes and stats either
in memory or tiered across memory and disk. Each cached value has a typed
in-memory form (Arc<Block>, Arc<SsTableIndexOwned>, Arc<[NamedFilter]>,
Arc<SstStats>) and a corresponding encode/decode pair for the disk tier. The
memory tier holds the typed struct directly; the disk tier holds its encoded
form (post-decompression and post-CRC) so hits in either tier skip the
decompression and CRC work the object store cache pays on every read. A memory
tier hit returns the existing Arc<T> with no CPU work; a disk tier hit pays
one serde deserialize and one type-specific decode pass to reconstruct the
struct. Foyer owns admission and eviction. This RFC does not change the block
cache.
They compare as follows:
| Dimension | Object store cache | Block cache (Foyer) |
|---|---|---|
| Cached unit | Part file (default 4 MB) of an SST | One block (~4 KiB), filter, index, or stat (typed Arc<T> in the memory tier, type-specific encoded form in the disk tier) |
| CPU on hit | CRC check, offsets parse, decompression if compression_codec is set, plus any configured block transformer | Memory tier: refcount bump. Disk tier: serde framing + one type-specific decode pass (the cached form is already decompressed and CRC, verified) |
| Per-SST bookkeeping | One file per part (e.g. 64 part files for a 256 MB SST at the 4 MB default) | Thousands of entries |
| Memory footprint | File metadata only | In-memory index grows with cached items (can reach GBs) |
| Restart cost | Zero, open files on demand | Rebuild the in-memory index from disk |
Mixing the two layers
Section titled “Mixing the two layers”Both caches can run together; they sit at different levels of the read path. The right configuration depends on the workload (working-set size, memory budget, read amplification, tolerance for cold-read latency). There are many valid setups: object store cache only, Foyer only, Foyer scoped to a subset of content types (e.g. filters/indexes/stats) layered over an object store cache for data part files, full Foyer over an object store cache, and so on. This RFC does not prescribe one.
The problem today
Section titled “The problem today”The object store cache lives inside the SlateDB crate. Several problems follow:
- SlateDB owns policy it should not own, and the policy is coarse. The
current cache exposes
cache_puts: bool(write-through on or off) and treats every object identically. It cannot distinguish a short-lived WAL write from a long-lived compaction output. Refining that policy, or adding any new signal, means editing core. - Two caches, two different pluggability levels. The block cache is already pluggable. The object store cache is not.
- Cache invalidation forces readers to hold the cache. Dropping fsync on
disk cache writes (#1571)
required SST readers to evict the stale entry themselves when a CRC
mismatch is observed. That works, but it means the reader paths now
carry a typed
Arc<CachedObjectStore>next to theirArc<dyn ObjectStore>and call cache-specific methods on it. Every new component that needs similar behavior pays the same plumbing cost. - Compaction is not aware of the object store cache. Compaction reads
and writes go straight to the raw object store and never see the cache
wrapper. To change that (e.g. to skip admitting compaction outputs, or
to let compaction inputs bypass the cache), every compactor-facing
component would need both an
Arc<dyn ObjectStore>and anArc<CachedObjectStore>, plus a new set of user-facing options to govern which one each call uses. Same shape as problem (3), and the same shape will reappear for any future component that wants per-call cache behavior. - Cached ObjectStore is not DST friendly.
CachedObjectStoredirectly usesspawn_blockingso it can’t be tested it in DST.
These all point at the same fix. A generic layer that abstracts the cache
behind a uniform call protocol means every component talks to one
ObjectStore and tags each call with its intent; the cache (or whatever
wrapper the user installed) acts on the tag. No component needs to know
whether a cache is present, what kind it is, or what its current policy is,
and no user-facing config has to grow per-component cache toggles.
The object_store crate already provides a suitable mechanism. Its
GetOptions, PutOptions, and PutMultipartOptions types each carry an
extensions field. SlateDB can attach metadata to it, and a wrapper can read it
back. Backends that do not know about the metadata (S3, GCS, Azure) ignore it.
- Define a typed protocol so a cache wrapper knows what each
ObjectStorecall is for. - Allow the user to plug in their own
ObjectStore(cache or otherwise) without SlateDB knowing it is there. - Keep the bundled
CachedObjectStoreavailable as one such wrapper, with admission policy driven by intent.
Non-Goals
Section titled “Non-Goals”- Defining an on-disk format for any specific cache.
- Picking a single canonical admission or eviction policy. That is the wrapper’s job.
- Building automatic warmup or eviction. Warmup stays caller-driven via the
existing
DbCacheManagerOps::warm_sstAPI. - Splitting the object store cache into its own crate. Discussed in the Open Questions.
Design
Section titled “Design”The intent protocol
Section titled “The intent protocol”The protocol is a small set of types. SlateDB will insert one of them into
extensions on every read or write:
pub struct WriteIntent { pub kind: WriteKind }
pub enum WriteKind { Flush, CompactionOutput, Manifest, Wal,}
pub struct ReadIntent { pub kind: ReadKind, pub retry: Option<RetryReason>,}
pub enum ReadKind { Foreground, CompactionInput, Warmup,}
pub enum RetryReason { CrcMismatch, BlockDecodeError, DecompressionError,}The protocol covers three things the wrapper needs to know:
- What kind of write is this? (
WriteIntent) Lets the wrapper decide admission per kind. A typical policy keeps data-bearing writes (Flush) and skips short-lived or tiny ones (Wal,Manifest) and bulk ones (CompactionOutput), but the policy is the wrapper’s choice. - What kind of read is this? (
ReadIntent) Lets the wrapper apply different policies to foreground queries, compaction-input scans, and warmups. - Is this a retry after a bad cache hit? (
ReadIntent.retry) SlateDB will set this when an SST decode fails (CRC, block decode, decompression). The wrapper decides how to respond; a common action is to evict the local cache entry, fetch fresh bytes from upstream, and re-cache. Without this signal, a wrapper that keeps serving the same corrupt bytes on every retry would trap the caller in an infinite retry loop.
Choice of interface
Section titled “Choice of interface”This RFC proposes carrying intent via object_store::Extensions. SlateDB
inserts a typed intent (WriteIntent or ReadIntent) into the extensions
field on every GetOptions, PutOptions, and PutMultipartOptions it builds;
a wrapper reads it with extensions.get::<ReadIntent>() or
get::<WriteIntent>(). No new public trait, and the protocol composes directly
with any generic ObjectStore wrapper at any level of the chain.
Cost: a small per-call heap allocation overhead, described in Allocation overhead.
Two alternative interfaces were considered and rejected for this RFC:
a SlateDB-owned trait that wraps an inner ObjectStore, and a SlateDB-owned
trait that replaces the object_store dependency entirely. Both are described
in Alternatives and remain viable future directions if
allocation pressure, ecosystem coupling, or extensibility needs grow.
One workaround is needed to make this current choice valid: upstream
object_store::buffered::BufWriter drops extensions on the single-PUT shutdown
path. See Caveats and we need to fix it upstream.
SlateDB core: protocol contract
Section titled “SlateDB core: protocol contract”SlateDB will support tagging every call to ObjectStore as much as possible:
- Tag every write with
WriteIntent. Every opts-capable write (put_optsand multipart upload init) will set aWriteIntent. WAL writers useWal. Memtable flush outputs useFlush. Compaction outputs useCompactionOutput. Manifest and compactions-state writes useManifest. Aside from documented upstream/API gaps, opts-capable writes from SlateDB core are tagged. - Tag every read with
ReadIntent. Every opts-capableget_optswill set aReadIntent. Foreground user queries useForeground. Compaction inputs useCompactionInput. TheDbCacheManagerOps::warm_sstpath usesWarmup. Aside from documented upstream/API gaps, opts-capable reads from SlateDB core are tagged. - Retry on validation failure with a retry tag. If decoding an SST fails
with a recoverable error (CRC mismatch, bad block, bad decompression),
SlateDB will reissue the read with
retry = Some(reason). SlateDB may retry up to a bounded number of times before surfacing the failure. The retry tag tells the wrapper to drop the cached file rather than return the same corrupt bytes again.
Untagged calls flow through the wrapper unchanged and the wrapper applies its
default behavior. The bundled CachedObjectStore treats untagged writes
according to its existing cache_puts policy, preserving pre-protocol behavior
for known gaps such as object_store::buffered::BufWriter.
SlateDB core: supporting plumbing
Section titled “SlateDB core: supporting plumbing”Three ObjectStore methods cannot carry an intent because object_store
0.13 has no *_opts variant for them: delete, list, and the
convenience head(path). They still flow through whatever wrapper the
user installed, so a cache wrapper has everything it needs without a
separate hook:
delete(path)is enough signal to evict the local cache entry. Garbage collection already callsObjectStore::delete, so cache eviction on delete happens for free.list(prefix)is enough for enumeration.head(path)(convenience form) flows through and can be served from cached metadata. Paths that need intent on a head request can useget_optswithheadset to true instead.
Where the cache will live
Section titled “Where the cache will live”SlateDB talks to object storage through Arc<dyn ObjectStore>. The user passes
the innermost backend to Db::builder(path, backend). That backend is a raw
store (S3, GCS, Azure, InMemory, LocalFileSystem).
A cache (or any other wrapper) is supplied separately as an ObjectStoreWrapper
through .with_object_store_wrapper(..). SlateDB keeps its instrumentation and
retry on the backend, exactly as today, and applies the wrapper on top of them.
The wrapper can be:
- The bundled
CachedObjectStore, exposed as a wrapper. - Any other intent-aware wrapper the user wrote.
If the user supplies no wrapper, the protocol is a no-op for caching purposes: SlateDB still attaches intents on every call, but nothing reads them.
If the user supplies a wrapper, every read and write flows through it with a typed intent on the extensions.
This keeps the physical layering identical to what SlateDB ships today. The
only change is that the cache is constructed from a user-supplied wrapper rather
than from Settings. See Builder layering for the exact
order and The cache as a wrapper for how the bundled and
custom caches both plug in.
Builder layering
Section titled “Builder layering”DbBuilder, GarbageCollectorBuilder, CompactorBuilder, and
DbReaderBuilder will assemble the main object store the same way, from
innermost to outermost:
1. The backend the user passed (a raw store: S3, GCS, Azure, InMemory, LocalFileSystem) | v2. InstrumentedObjectStore (per-operation metrics) | v3. RetryingObjectStore (transient failure retries) | v4. The cache wrapper the user supplied, if any (the bundled CachedObjectStore or a custom intent-aware wrapper) | v SlateDB core (TableStore, ManifestStore, CompactionsStore, GC)Layers 2 and 3 are SlateDB-managed and sit directly on the backend, the same
order SlateDB ships today. Layer 4 is whatever cache wrapper the user supplied,
applied on top. SlateDB core will not construct a cache on the user’s behalf;
the bundled CachedObjectStore is one of several wrappers the user may supply.
Instrumentation and retry stay on the backend for two reasons. The metrics on layer 2 count only requests that reach the backend, so a cache hit served by layer 4 is not counted and the numbers continue to measure object store API traffic. Retry on layer 3 handles only backend errors with the policy designed for them, and the cache wrapper keeps full control of its own internal error handling.
Users who want the bundled cache supply it as a wrapper with
CachedObjectStore::builder:
let cache = CachedObjectStore::builder() .root_folder(dir) .build();let db = Db::builder(path, raw_backend) .with_object_store_wrapper(cache) .build() .await?;root_folder is the only required setting. Other settings
(cache_puts, part_size_bytes, max_cache_size_bytes, scan_interval,
max_open_file_handles) have defaults and are set via builder methods when
needed. SlateDB internals (metrics_recorder, system_clock, rand) are
defaulted as well and only overridden for tests or to share state with a
specific Db instance.
ObjectStoreCacheOptions will live in slatedb::cached_object_store, not in
Settings. SlateDB core will have no field that describes the cache; the
protocol on each call is the only contract it knows.
This is the same physical order SlateDB ships today: instrumentation and retry
wrap the backend, and the cache wraps them. What changes is only how the cache
is constructed. Today SlateDB reads Settings::object_store_cache_options and
builds CachedObjectStore itself. Under this proposal the user supplies the
cache as a wrapper and SlateDB applies it at step 4.
Because retry sits below the cache wrapper, the cache keeps full control of its own internal error handling. A transient disk I/O failure inside the cache, a corrupt local part, or a missing cache file are the cache’s to resolve (fall back to upstream, evict and refetch, or surface the error), and SlateDB does not apply a backend retry policy to them. The cache’s own calls to its inner store still get backend retries, because that inner store is the retrying store, so a custom wrapper never implements retry for backend calls itself.
Two consequences worth calling out:
- Compactor and GC builders will only add instrumentation and retry on the backend; they will not apply a cache wrapper. Users who want compactor reads or GC operations to go through their own cache supply the wrapper to those builders.
- Manifest writes and compactions metadata writes will flow through the cache
wrapper (because there is only one main store handle). They will still skip
the cache because they carry
WriteIntent::manifest()and the bundled wrapper short-circuits that intent. See Cache wrapper behavior.
The cache as a wrapper
Section titled “The cache as a wrapper”A cache plugs in as an ObjectStoreWrapper. SlateDB owns the seam: it builds the
instrumented, retrying backend first, then calls the wrapper to wrap it. This is
what keeps instrumentation and retry on the backend while leaving the cache
pluggable.
/// A wrapper SlateDB applies on top of the instrumented, retrying backend.////// SlateDB instruments and wraps the user's backend with retry first, then/// calls `wrap`, so the returned store sits above retry and below SlateDB core.#[async_trait]pub trait ObjectStoreWrapper: Send + Sync { async fn wrap( &self, inner: Arc<dyn ObjectStore>, ) -> Result<Arc<dyn ObjectStore>, SlateDBError>;}Each builder assembles the main store the same way. The instrument and retry unit is unchanged from today; the only new step is the optional wrapper on top:
// Unchanged: Instrument then Retry over the user's backend, with the// per-builder component and store_type labels.let base = instrumented_retrying_object_store( backend, &recorder, component, store_type, rand.clone(), system_clock.clone(),);
// The cache wrapper, if supplied, wraps the instrumented, retrying backend.let main_store = match &self.object_store_wrapper { Some(wrapper) => wrapper.wrap(base).await?, None => base,};The bundled CachedObjectStore. CachedObjectStore is now the wrapper the
user constructs: it holds the cache configuration and implements
ObjectStoreWrapper. The live cache ObjectStore that does the part-file I/O
becomes a private inner type (CachedObjectStoreInner below); wrap builds it
from the configuration and the inner store. The async setup that
CachedObjectStore::from_config does today moves onto that inner type:
impl CachedObjectStore { /// Start building the bundled cache wrapper. Holds config only, no I/O yet. pub fn builder() -> CachedObjectStoreBuilder { /* ... */ }}
#[async_trait]impl ObjectStoreWrapper for CachedObjectStore { async fn wrap( &self, inner: Arc<dyn ObjectStore>, ) -> Result<Arc<dyn ObjectStore>, SlateDBError> { // `inner` is the instrumented, retrying backend for this component. // `CachedObjectStoreInner` is the private live cache store. let cache = CachedObjectStoreInner::from_config(inner, &self.options, /* ... */).await?; Ok(cache) }}In every case the backend, instrumentation, and retry are assembled by SlateDB
in the same order it uses today. The cache is the one wrapper that moves out of
SlateDB and into a user-supplied ObjectStoreWrapper.
Cache wrapper behavior
Section titled “Cache wrapper behavior”The bundled CachedObjectStore initially keeps its current behavior:
- Reads always admit fetched bytes to the local cache on a miss.
- Writes follow the existing
cache_puts: boolmaster switch (off by default; when on, every PUT is written through). - Multipart puts pass through (never cached), as today.
Adopting the intent protocol does not change any of that on day one. What it does change is what the wrapper can do without further plumbing into SlateDB core. Each of these evolutions is now expressible as a small, isolated wrapper change, and each can land as its own follow-up:
| intent | Wrapper change |
|---|---|
WriteIntent::wal() on PUT with cache_puts | Skip the local write; forward to upstream only. |
WriteIntent::manifest() on PUT with cache_puts | Skip the local write; forward to upstream only. |
WriteIntent::compaction_output() on PUT | Add option to cache on compaction |
ReadIntent::compaction_input() on GET | Add option to skip cache lookup and skip miss-admit; forward to upstream. |
ReadIntent.retry = Some(_) on GET | Delete the cache entry for path, then run the usual cache-aware fetch. Closes problem 3; lets the fsync workaround be removed. |
Each future change is local to the wrapper. SlateDB core’s contract (tag every call with intent) does not move.
Concrete policy choices (admission per kind, part sizes, prefetch shape,
on-disk layout, eviction strategy) are owned by the wrapper. Users who want
different choices implement their own ObjectStore and pass it to
Db::builder.
Example
Section titled “Example”Simplest form with the bundled cache:
let backend: Arc<dyn ObjectStore> = Arc::new(InMemory::new());let cache = CachedObjectStore::builder() .root_folder(cache_dir) .build();let db = Db::builder("my-db", backend) .with_object_store_wrapper(cache) .build() .await?;With more options set:
let cache = CachedObjectStore::builder() .root_folder(cache_dir) .cache_puts(true) .part_size_bytes(8 * 1024 * 1024) .max_cache_size_bytes(32 * 1024 * 1024 * 1024) .build();For tests or advanced wiring (shared metrics recorder, mock clock, deterministic RNG):
let cache = CachedObjectStore::builder() .root_folder(cache_dir) .metrics_recorder(shared_recorder) .system_clock(mock_clock) .rand(seeded_rand) .build();Allocation overhead
Section titled “Allocation overhead”Tagging a single call pays three small heap allocations: the Box<HashMap> for
the lazily-initialized Extensions map, the HashMap’s bucket array on first
insert, and the Box<val> for the intent itself. None of the three can be
reused across calls because ObjectStore::get_opts and friends take
GetOptions by value, and Extensions::clone() is a deep clone of all three
pieces.
This does not affect read latency in any meaningful way next to a network or local-disk round trip. On hot paths (compaction reading thousands of blocks, warmup loops issuing many GETs) the aggregate allocator pressure is non-zero but well below the cost of the I/O itself.
A future optimization, if allocations come to matter, is to switch to
SlateDbObjectStore (zero allocation overhead, see
Alternatives) or to a static-intent-per-handle arrangement if
object_store ever changes get_opts to take &GetOptions.
Caveats
Section titled “Caveats”-
Upstream
BufWriterextensions bug. SST writes go throughobject_store::buffered::BufWriterwith extensions attached via.with_extensions(..).BufWriter::poll_shutdowndrops extensions on the single-PUT path (payload fits in capacity); the multipart overflow path forwards them. Tracked as apache/arrow-rs-object-store#735. Until that fix lands, SST writes below the default 10 MB BufWriter capacity will reach a wrapper untagged. Two matrix tests will document the bug (bufwriter_small_payload_drops_write_intent,bufwriter_large_payload_carries_write_intent) and act as regression catches for when the upstream fix lands.Practical effect: in production, small SST writes (memtable flushes that fit in 10 MB, occasional small compaction outputs) may reach the wrapper untagged. Since the bundled
CachedObjectStoreinitially keeps its current policy, those writes follow the pre-protocol behavior. WAL and Manifest writes do not go through BufWriter and so will always carry their intent correctly. -
Multipart-part-level intent on the wire.
WriteIntentattaches atput_multipart_optsinit, not on individualput_partcalls. A cache wrapper that wrapsMultipartUpload(which it must do anyway to observe part bytes) holds the intent in its own state and applies it to every part. Wrappers that do not wrapMultipartUploadsee only the init call. The bundled cache will not admit multipart uploads, so this gap does not affect it.
Impact Analysis
Section titled “Impact Analysis”Core API & Query Semantics
Section titled “Core API & Query Semantics”- Basic KV API (
get/put/delete) - Range queries, iterators, seek semantics
- Range deletions
- Error model, API errors
Consistency, Isolation, and Multi-Versioning
Section titled “Consistency, Isolation, and Multi-Versioning”- Transactions
- Snapshots
- Sequence numbers
Time, Retention, and Derived State
Section titled “Time, Retention, and Derived State”- Time to live (TTL)
- Compaction filters
- Merge operator
- Change Data Capture (CDC)
Metadata, Coordination, and Lifecycles
Section titled “Metadata, Coordination, and Lifecycles”- Manifest format: unchanged. Manifest writes will carry
WriteIntent::manifest()via theslatedb-txn-objconstructor. - Checkpoints
- Clones
- Garbage collection: GC deletes will flow through the wrapper, triggering cache eviction as a side effect. GC will use the same consolidated main object store handle.
- Database splitting and merging
- Multi-writer
Compaction
Section titled “Compaction”- Compaction state persistence: writes will carry
WriteIntent::manifest()via thecompactions_storeconstructor. - Compaction filters
- Compaction strategies
- Distributed compaction
- Compactions format: compaction-output SST writes will carry
WriteIntent::compaction_output(); compaction-input reads will carryReadIntent::compaction_input().
Storage Engine Internals
Section titled “Storage Engine Internals”- Write-ahead log (WAL): WAL writes will carry
WriteIntent::wal()so wrappers can skip admitting them. - Block cache: unchanged; the in-process block cache continues to operate independently.
- Object store cache: today’s
CachedObjectStorewill be decoupled from core construction and made user-constructible. Its pre-existingcache_puts: boolmaster switch and admission behavior are preserved initially. - Indexing (bloom filters, metadata)
- SST format or block format: the decode path will issue retry-tagged reads on recoverable validation failures.
Ecosystem & Operations
Section titled “Ecosystem & Operations”- CLI tools
- Language bindings (Go/Python/etc)
- Observability (metrics/logging/tracing): no new SlateDB metrics. The bundled cache wrapper will continue to emit its existing stats; wrappers own their own observability.
Operations
Section titled “Operations”Performance and Cost
Section titled “Performance and Cost”Initially, the measurable runtime change is per-call tagging overhead:
- CPU and memory pressure. Tagging adds a small per-call heap allocation overhead. Latency against network or local-disk I/O is unaffected, and aggregate allocator traffic on compaction or warmup stays well below the I/O cost. See Allocation overhead.
Latency, throughput, and object-store cost do not otherwise change initially because the bundled wrapper keeps its current behavior (see Cache wrapper behavior). They can evolve as the follow-ups in that section land:
- Skipping WAL / manifest / compaction-output write-through reduces local
disk writes when
cache_puts: trueand leaves more room for hot data. - Bypassing compaction-input reads keeps one-shot bulk reads out of the cache, trading cache-hit savings on compactor reads for a smaller working set on foreground reads.
- Evicting on retry-tagged reads lets the fsync workaround go away, removing one local disk sync per admitted write at the cost of one extra upstream GET on the rare CRC-mismatch path.
Observability
Section titled “Observability”- SlateDB will add no new metrics for the protocol itself.
- The bundled
CachedObjectStoreretains its existing hit-rate, eviction, and admission stats. - User-supplied wrappers own their own metrics.
Compatibility
Section titled “Compatibility”- API. Users who set
Settings::object_store_cache_options.root_folderto enable the bundled cache will need to construct aCachedObjectStore::builder()and supply it throughDb::builder(..).with_object_store_wrapper(..)(the auto-construct convenience is removed when the field moves out ofSettings). The bundled cache’s initial admission behavior, includingcache_puts, is preserved. - On-disk format. Unchanged. The disk format used by
CachedObjectStorestays the same, so cached files survive the migration. - Wire format. Unchanged.
- Wrappers and backends that ignore the protocol. Unaffected.
Rollout
Section titled “Rollout”The work spans two crates: slatedb-txn-obj and slatedb. slatedb depends
on slatedb-txn-obj, so the Extensions support on the slatedb-txn-obj
protocol and boundary objects lands first; the SlateDB-side changes that use
it follow.
Alternatives
Section titled “Alternatives”Keep the cache in SlateDB. Works today. Locks policy into SlateDB, makes it hard to ship alternative caches, and forces every policy change through a SlateDB release. Rejected.
A separate SlateDbObjectStore trait that wraps an inner ObjectStore.
Define SlateDB’s own storage interface as a distinct trait whose methods take
intents as regular arguments, but keep object_store as the backend layer the
user constructs:
pub trait SlateDbObjectStore { async fn get(&self, path: &Path, range: Range<usize>, intent: ReadIntent) -> Result<Bytes>; async fn put(&self, path: &Path, payload: PutPayload, intent: WriteIntent) -> Result<()>; async fn delete(&self, path: &Path) -> Result<()>; // ...}Zero allocation overhead and SlateDB-owned. SlateDB still depends on
object_store and the user still constructs an ObjectStore backend to hand
in (wrapped by a default SlateDbObjectStore adapter, or by a custom one).
Trade-off: a new public trait that SlateDB must version, and generic
intermediate wrappers (rate limiter, encryption) must either sit below the
cache or be rewritten as SlateDbObjectStore-aware versions.
Not the proposed approach. The Extensions allocation cost is dwarfed by the
I/O on every realistic call. If that changes, the migration is mechanical:
replace extensions.insert(intent) with a typed argument and
extensions.get::<T>() with a method parameter.
Intent-aware cache policy as a thin router above CachedObjectStore. Keep
CachedObjectStore as the physical cache, add a layer above that decides
per-intent whether to use the cache, look up only, bypass, evict and refetch, or
write-through; the policy trait would be the public surface.
Not the proposed approach. Reasons: the bundled CachedObjectStore can do the
routing internally based on intent, the proposed policy trait would still
require typing the routing decisions, and allowing users to bring a fully custom
cache implementation (different on-disk format, io_uring, sharding,
O_DIRECT) is more flexible than letting them substitute only the policy.
A SlateDB-owned trait that does not wrap ObjectStore at all. A bigger
version of the previous alternative. SlateDB defines its own storage trait
(SlateDbObjectStore) and ships built-in adapters (e.g. for object_store
and opendal). The object_store crate is no longer a public dependency:
users pick from SlateDB’s adapters or implement the trait themselves against
a raw SDK. Zero allocation overhead, full control over the storage contract
(batch ops, richer cache hints, per-call deadlines, etc.), and freedom to
evolve without coordinating with upstream.
Cost: SlateDB owns the adapter surface and any future backend shape changes, and users who want a backend not yet adapted have to write the adapter themselves. The biggest structural change of the three trait options, and the biggest ownership commitment. Not the proposed approach, kept on the table as a future direction if ecosystem coupling or extensibility needs grow.
Defer to OpenDAL’s cache layer. OpenDAL recently merged a cache layer RFC
(apache/opendal#6297) that
supports chunk caching. With the protocol in this RFC, nothing prevents a user
from wiring an OpenDAL-backed cache as their ObjectStore. If OpenDAL’s cache
layer reads typed cache hints from Extensions or grows an intent-equivalent
argument, this RFC’s protocol bridges to it directly. SlateDB does not need to
choose; users do.
Static intent per ObjectStore handle (requires upstream API change).
Construct multiple internal handles each preconfigured with a fixed intent,
sharing pre-built GetOptions / PutOptions across many calls. Only valid if
ObjectStore::get_opts takes &GetOptions by reference instead of by value.
The upstream change is on the object_store crate; if it ever lands, this
becomes the cleanest allocation-free path that keeps the Extensions protocol.
Open Questions
Section titled “Open Questions”- Should the object store cache live in its own crate (e.g.
slatedb-obj-cache)? Not proposed here. With the protocol in place and the cache no longer required for correctness, but it seems like the right step to do after the RFC is executed. - Allocation cost in practice. Not measured. Worth profiling on compaction
and warmup paths before committing to further optimization (or before
switching to
SlateDbObjectStore).
References
Section titled “References”object_store::GetOptions(carriesextensions: ::http::Extensions)object_store::PutOptions(carriesextensions: ::http::Extensions)- RFC 0023: Targeted Cache Warming and Best-Effort Block Cache Eviction
- Relevant issue: slatedb/slatedb#703
- Existing in-tree
CachedObjectStore(slatedb/src/cached_object_store/) - Upstream
BufWriterextensions bug: apache/arrow-rs-object-store#735 - OpenDAL cache layer: apache/opendal#6297
- OpenDAL
FoyerService: apache/opendal#7160 - Draft PR for a prototype for the implementation