Garbage Collector Boundary Files for Sequenced Metadata
Status: Draft
Authors:
Summary
Section titled “Summary”SlateDB stores .manifest and .compactions state in sequenced
object-store files. The filename is the commit point: a writer creates the next
file with a create-if-absent operation, and success means the sequenced update
won.
This protocol is unsafe when a writer stalls longer than the garbage
collector’s min_age setting. A stalled writer can prepare file N+1, another
writer can create and supersede N+1, GC can later delete N+1, and the
stalled writer can then resume and successfully create the same filename. The
stale writer observes success even though the original create-if-absent fencing
point should have rejected it.
This RFC adds durable boundary files for the .manifest and .compactions
namespaces. Before GC deletes old sequenced metadata files, it advances the
namespace boundary. After a writer creates a sequenced metadata file, it checks
the boundary before returning success. If the created ID is at or behind the
boundary, the write is treated as failed.
Background
Section titled “Background”SlateDB has two sequenced metadata namespaces:
.manifestfiles, named like00000000000000000012.manifest.compactionsfiles, named like00000000000000000013.compactions
Both namespaces use the same object-store sequencing pattern:
- Read the latest object ID.
- Compute the next object ID.
- Write the next object with create-if-absent.
- Treat create success as the committed update.
GC currently deletes old metadata files using the existing retention rules:
.manifest: delete files older thanmin_agewhen they are not the latest manifest and are not referenced by an active checkpoint..compactions: delete files older thanmin_agewhen they are not the latest compactions file.
These rules assume that a writer will not pause between computing the next
object ID and creating that object for longer than min_age.
Motivation
Section titled “Motivation”The unsafe sequence is the same for both metadata namespaces:
- Writer A reads metadata file
Nand prepares an update to fileN+1. - Writer A stalls before creating
N+1. - Writer B creates
N+1; later writers advance the namespace toN+2,N+3, and so on. - GC deletes
N+1after it becomes old enough and is no longer retained by the namespace’s normal retention rules. - Writer A resumes and creates
N+1successfully because the object no longer exists. - Writer A treats the stale update as committed.
The root problem is that object-store create-if-absent only protects against objects that currently exist. Once GC deletes a sequenced metadata object, create-if-absent no longer remembers that the ID was already used.
- Prevent stale
.manifestand.compactionswrites from returning success after GC has made their target IDs unsafe to reuse. - Preserve the existing sequenced metadata write protocol and normal GC retention rules.
- Make boundary checks efficient enough for normal metadata writes.
- Make boundary implementation flexible enough to be used for WAL boundaries, should we want to add those in the future.
Non-Goals
Section titled “Non-Goals”- Redesign metadata storage or replace sequenced object filenames as the commit point.
- Move metadata deletion into the writer or compactor processes.
- Cover storage namespaces other than
.manifestand.compactions.
Design
Section titled “Design”Boundary Files
Section titled “Boundary Files”Add one boundary file per sequenced metadata namespace:
/gc/manifest.boundary/gc/compactions.boundary
Each file contains a single ASCII-encoded u64:
12The value is an inclusive high-watermark. A boundary value B means that
metadata IDs <= B in that namespace must be treated as potentially deleted.
Writers must not treat a newly created sequenced metadata file with ID i <= B
as successful.
The protocol has two invariants:
- Before GC may delete sequenced metadata file ID
i, the durable boundary for that namespace must be>= i. - Before a writer may return success for sequenced metadata file ID
i, it must observe the durable boundary for that namespace as< iafter the file create succeeds.
If a boundary file has never existed, readers use boundary value 0. If a
process has observed a boundary file and later finds it missing, it panics
because GC must never delete boundary files.
Advancing Boundaries
Section titled “Advancing Boundaries”GC advances the boundary before deleting old metadata files from a namespace.
For a namespace, GC computes the desired boundary as:
- List the namespace’s metadata files.
- Remove the most recent metadata file from the list (it must always be kept).
- Keep files in the list whose object-store
last_modifiedtimestamp is older thanmin_age. - Choose the maximum file ID from that filtered list.
If no files are old enough, GC skips the boundary update for that namespace.
Boundary updates are monotonic:
- Read the current boundary value and object version metadata.
- If the desired boundary is less than or equal to the current value, the boundary is already advanced.
- If the boundary file is missing, create it with create-if-absent.
- Otherwise, update it with a conditional object-store update using the version metadata from step 1.
- If a concurrent GC wins the conditional update race, retry until the durable boundary is greater than or equal to the desired value.
The boundary file must never move backward.
Checking Boundaries After Writes
Section titled “Checking Boundaries After Writes”Every sequenced metadata write must check the corresponding boundary after the create-if-absent operation succeeds:
- Create the next metadata file with create-if-absent.
- If create-if-absent fails because the object already exists, return the existing sequenced write conflict error.
- Read the namespace boundary.
- If the just-created ID is less than or equal to the boundary, return a boundary error and do not report the write as committed.
- Otherwise, return success.
Boundary reads can be optimized with an in-memory cache and conditional GETs
using If-None-Match. If the object store returns “not modified”, the writer
can reuse the cached boundary value. If it returns a new value, the writer
updates the cache and checks the created ID against that value.
The boundary read must not be served from a stale object cache. Manifest and
compactions stores must not use CachedObjectStore.
Garbage Collection
Section titled “Garbage Collection”After advancing a namespace boundary, GC continues to apply the normal deletion rules:
.manifest: delete files at or behind the boundary when they are older thanmin_age, are not the latest manifest, and are not referenced by an active checkpoint..compactions: delete files at or behind the boundary when they are older thanmin_ageand are not the latest compactions file.
Implementation
Section titled “Implementation”- Add
BoundaryObjectto the transactional object crate with:check(id): verify thatidis greater than the durable boundary.advance(boundary): durably advance the boundary to at leastboundary.
- Add
BoundedSequencedStorage<T>, aSequencedStorageProtocol<T>wrapper that delegates the write and then callsBoundaryObject::checkbefore returning success. - Add
ObjectStoreBoundaryObject, stored under<root>/gc/<name>.boundary, using ASCIIu64encoding. - Add
ObjectVersionBehindBoundary { id, boundary }error type to represent a write that created an ID at or behind the durable boundary. - Wrap
ManifestStorewithmanifest.boundary. - Wrap
CompactionsStorewithcompactions.boundary. - Add
advance_boundarymethods to the manifest and compactions stores for GC. - Update manifest and compactions GC tasks to compute the maximum old-enough ID and advance the boundary before deleting files.
- Remove
CachedObjectStoreusage forManifestStoreandCompactionsStoreto ensure boundary checks are not served from a stale cache. Enforce this by assertingPutMode::Createand allGETs are unconditional.
Impact Analysis
Section titled “Impact Analysis”SlateDB features and components that this RFC interacts with:
Core API & Query Semantics
Section titled “Core API & Query Semantics”- Basic KV API (
get/put/delete) - Range queries, iterators, seek semantics
- Range deletions
- Error model, API errors
Consistency, Isolation, and Multi-Versioning
Section titled “Consistency, Isolation, and Multi-Versioning”- Transactions
- Snapshots
- Sequence numbers
Time, Retention, and Derived State
Section titled “Time, Retention, and Derived State”- Time to live (TTL)
- Compaction filters
- Merge operator
- Change Data Capture (CDC)
Metadata, Coordination, and Lifecycles
Section titled “Metadata, Coordination, and Lifecycles”- Manifest format
- Checkpoints
- Clones
- Garbage collection
- Database splitting and merging
- Multi-writer
Compaction
Section titled “Compaction”- Compaction state persistence
- Compaction filters
- Compaction strategies
- Distributed compaction
- Compactions format
Storage Engine Internals
Section titled “Storage Engine Internals”- Block cache
- Object store cache
- Indexing (bloom filters, metadata)
- SST format or block format
Ecosystem & Operations
Section titled “Ecosystem & Operations”- CLI tools
- Language bindings (Go/Python/etc)
- Observability (metrics/logging/tracing)
Operations
Section titled “Operations”Performance & Cost
Section titled “Performance & Cost”- Latency (reads/writes/compactions): metadata writes add one boundary check after each write.
- Throughput (reads/writes/compactions): metadata write throughput may drop with the extra object-store round trip.
- Object-store request (GET/LIST/PUT) and cost profile: checks can use cached ETags but must still GET; GC may add one conditional boundary update per namespace.
- Space, read, and write amplification: adds two small boundary files and no data-file amplification.
Observability
Section titled “Observability”- Configuration changes: none.
- New components/services: none.
- Metrics: track boundary check latency, advance attempts, and rejected stale writes.
- Logging: warn on boundary advance failures and stale write rejections.
Compatibility
Section titled “Compatibility”- Existing data on object storage / on-disk formats: missing boundary files
read as boundary
0; new files are additive. - Existing public APIs (including bindings): no API changes.
- Rolling upgrades / mixed-version behavior (if applicable): mixed versions retain the existing stale-writer risk until all writers check boundaries.
Testing
Section titled “Testing”- Unit tests: standard unit tests for new code paths.
- Integration tests: None.
- Fault-injection/chaos tests: None.
- Deterministic simulation tests: DST covers this pattern. Draft PR triggered expected failures.
- Formal methods verification: SequencedMetadataBoundary.fizz is included in the formal verification suite.
- Performance tests: None.
Rollout
Section titled “Rollout”- Milestones / phases: None.
- Feature flags / opt-in: None.
- Docs updates: files.mdx and gc.mdx will be updated to explain boundary files and their role in GC safety.
Alternatives
Section titled “Alternatives”Status quo
Section titled “Status quo”- Keep relying on
min_ageas a writer-stall bound. - Rejected because pathological stalls can still allow stale sequenced metadata writes to commit.
Increase min_age
Section titled “Increase min_age”- Reduces the probability of the bug.
- Rejected because it does not eliminate the failure mode and increases metadata retention.
References
Section titled “References”- slatedb/slatedb#1646: Add
BoundaryObjectGC watermarks to prevent data loss in slatedb-txn-obj - slatedb/slatedb#1622: pathological data-loss configuration
- OSWALD: a WAL implementation with a
similar boundary concept. OSWALD’s “snapshot” is our
.manifest(or.compactions), and its “manifest” is our boundary file. This is a rough analogy, not a direct mapping, but the manifest serves a similar purpose.