Compaction State Persistence
Status: Accepted
Authors:
Background
Section titled “Background”Compaction currently happens for the following:
- L0 SSTs
- Various level Sorted Runs(range partitioned SST across the complete keyspace)
This RFC proposes the goals & design for compaction state persistence along with ways to improve current compaction mechanism by adding retries and tracking.
- Provide a mechanism to track progress of a
CompactionJob
- Allow retrying compactions based on the state of the
CompactionJob
- Improve observability around compactions
- Separate out compaction related details from
Manifest
into a separateCompactionState
- Coordination between
Manifest
andCompactionState
- Coordination mechanism between externally triggered compactions and the main compaction process.
- Refactor Manifest store so that it can be used to store both ,manifest and .compactor files.
Non-Goals
Section titled “Non-Goals”- Distributed compaction: SlateDb is a single writer and currently a single-compactor based database. With distributed compaction, we plan to further parallelise SST compaction across different compaction processes. This topic is out of scope of the RFC.
- Resuming partial compaction under MVCC depends on the sorted-run layout: with multi-versioned keys, how do we partition the keyspace into non-overlapping SSTs within a single SR?
Constraints
Section titled “Constraints”- Changes should be backward compatible and extend the existing compaction structs
- State updates should be cost efficient
- Manifest can be eventually consistent with the latest view after compaction
References
Section titled “References”Problem Statement
Section titled “Problem Statement”This RFC extends discussions in the below github issue. It also addresses several other sub-issues.
Core Architecture Issues
Section titled “Core Architecture Issues”- 1:1 Compaction:Job Cardinality: Cannot retry failed compactions - entire compaction fails if job fails
- No Progress Tracking:
CompactionJob
state isn’t persisted, making progress invisible - No State Persistence: All compaction state is lost on restart
Operational Limitations
Section titled “Operational Limitations”- Manual Compaction Gaps: No coordination mechanism for operator-triggered compactions (Issue #288)
- GC Coordination Issues: Garbage collector needs better visibility into ongoing compactions (Issue #604)
- Limited Observability: Limited visibility into compaction progress and failures
Impact
Section titled “Impact”- Large compactions (multi-GB) lose hours of work on failure
- Engineering overhead for debugging and manually restarting failed compactions
- Customer impact from extended recovery times during outages
- Resource waste from repeated processing of the same data
Proposal
Section titled “Proposal”Core Strategy: Iterator-Based Persistence
Section titled “Core Strategy: Iterator-Based Persistence”Rather than complex chunking mechanisms, we leverage SlateDB’s existing iterator architecture which provides natural persistence boundaries at SST completion points. This approach:
- Builds on existing infrastructure: Enhances current
execute_compaction
method - Uses natural boundaries: SST completions provide ~256MB recovery granularity
- Minimizes overhead: Persistence aligns with existing I/O patterns
- Scales cost-effectively: Higher persistence frequency for larger, more valuable compactions
Workflow
Section titled “Workflow”Current Compaction Workflow
Section titled “Current Compaction Workflow”-
Compactor
initialises theCompactionScheduler
andCompactionEventHandler
during startup. It also initialises event loop that periodically polls manifest, periodically logs and provides progress and handles completed compactions [No change required] -
The
CompactionEventHandler
refreshes the compaction state by merging it with thecurrent manifest
. -
CompactionEventHandler
communicates this compaction state to theCompactionScheduler
(scheduler makes a callmaybeScheduleCompaction
with local database state). -
CompactionScheduler
is implemented bySizeTieredCompactionScheduler
to decide and group L0 SSTs and SRs to be compacted together. It returns a list ofCompaction
that are ready for execution. -
CompactorEventHandler
iterates over the list of compactions and callssubmitCompaction()
if the count of running compaction is below the threshold. -
The submitted compaction is validated that it is not being executed (by checking in the local
CompactorState
) and if true, is added to theCompactorState
struct. -
Once the
CompactorEventHandler
receives an affirmation, it calls thestartCompaction()
to start the compaction. -
The compaction is now transformed into a
compactionJob
and a blocking task is spawned to execute thecompactionJob
by theCompactionExecutor
-
The task loads all the iterators in a
MergeIterator
struct and runs compactions on it. It discards older expired versions and continues to write to a SST. Once the SST reaches it’s threshold size, the SST is written to the active destination SR. Periodically the task also provides stats on task progress. -
When a task completes compaction execution, the task returns the
{destinationId, outputSSTs}
to the worker channel to act upon the compaction terminal state -
The worker task executes the
finishCompaction()
upon successfulCompactionCompletion
and updates the manifests and trigger scheduling of next compactions by callingmaybeScheduleCompaction()
-
In case of failure, the compaction_state is updated by calling
finishFailedCompaction()
-
GC clears the orphaned states and SSTs during it’s run.
Proposed CompactionState Structure
Section titled “Proposed CompactionState Structure”The persistent state contains the complete view of all compaction activity:
pub(crate) struct CompactionJob { pub(crate) id: Ulid, pub(crate) destination: u32, pub(crate) ssts: Vec<SsTableHandle>, pub(crate) sorted_runs: Vec<SortedRun>, pub(crate) compaction_ts: i64, pub(crate) is_dest_last_run: bool, pub(crate) completed_input_sst_ids: Vec<Ulid>; pub(crate) completed_input_sr_ids: Vec<u32>; pub(crate) output_sr: SortedRun;}
pub(crate) enum CompactionType { Internal, External,}
pub struct Compaction { pub(crate) status: CompactionStatus, pub(crate) sources: Vec<SourceId>, pub(crate) destination: u32, pub(crate) compaction_id: Ulid, pub(crate) compaction_type: CompactionType, pub(crate) job_attempts: Vec<CompactionJob>;}
pub(crate) CompactorState { manifest: DirtyManifest compaction_state: DirtyCompactionState}
pub(crate) struct CompactionState { compactor_epoch: u64, // active_compactions queued, in-progress and completed compactions: HashMap<Ulid, Compaction>,}
pub(crate) struct DirtyCompactionState { id: u64, compactor_epoch: u64, compaction_state: CompactionState,}
pub(crate) struct StoredCompactionState { id: u64, compaction_state: CompactionState, compaction_state_store: Arc<CompactionStateStore>,}
pub(crate) struct FenceableCompactionState { compaction_state: StoredCompactionState, local_epoch: u64, stored_epoch: fn(&CompactionState) -> u64,}
Persisting Internal Compactions
Section titled “Persisting Internal Compactions”-
Compactor fetches compactions from the compaction_state polled during this compactionEventLoop iteration with the compactionStatus as
submitted
and returns a list of compactions. -
SizeTieredCompactionScheduler executes
maybe_schedule_compaction
and appends to this list of compactions. -
Compactor executes the
submit_compaction
method on the list of compactions from Step(1). The method delegates the validation of the compactions to the compactor_state.rs. -
For each compaction in the input list of compactions, the compactor_state.rs executes its own
submit_compaction
method that would do the following validations against the compaction_state:-
Check if the count of running compactions is less than the threshold. If yes, continue
-
Check if the source L0 SSTs and SRs are not part of any other compaction. If yes, continue.
-
Check if the destination SR is not part of any other compaction. If yes, continue.
-
Add compaction validations to verify correct group of sources and destinations are selected. (Reference here.)
-
The existing validations in
submit_compaction
method.
-
-
When a compaction successfully validates, the status of the compaction is updated as
in_progress
in the compaction_state. When the validation is unsuccessful, the status of the compaction is updated asfailed
in the CompactionState. -
Try writing the compactor_state to the next sequential .compactor file.
If file exists,
-
If latest .compactor compactor_epoch > current compactor epoch, die (fenced)
-
If latest .compactor compactor_epoch == current compactor epoch, reconcile the compactor_state and retry. (This would happen only when an external process like CLI has written a manual compaction request.)
-
If latest .compactor compactor_epoch < current compactor epoch, panic (
compactor_epoch
went backwards).
(When this step is successful, compaction is persisted in the .compactor file.)
-
-
Now,
start_compaction()
for each compaction in thecompactions
param if the count of running compactions is below threshold. -
A new
CompactionJob
is created using the last job_attempt or a fresh if it is the first CompactionJob. TheCompactionJob
is then handed to the CompactionExecutor for execution. -
We need to update CompactionExecutor code to support the following:
-
Resuming Partially executed compactions (covered separately in the section below)
-
Writing compaction_state updates to the .compactor file
The CompactionExecutor would persist the compaction_state in .compactor file by updating the
compactions
param (refer state management protocol). Two possible options:-
Each compactionExecutor Job tries writing to the .compactor file.
-
Writes the updated compacted_state to a blocking channel that would be listened and executed by the Compaction Event Handler. We can leverage
WorkerToOrchestratorMsg
enum with a oneshot ack to support blocking of the CompactionJob on the write.
-
We have agreed on the second approach (channel-based updates via the compaction event handler).
- Once the compactionJob is completed, follow the steps mentioned in the State Management protocol.
Persisting External Compactions
Section titled “Persisting External Compactions”We need a mechanism to plug in the external requests so that they can be picked up and executed by the compaction workflow. The idea is to leverage the existing compaction workflow. The steps are outlined here:
-
Client provides the list of source_ssts and source_srs to be compacted through a
submit_manual_compaction
method inAdmin
(see the Administrative Commands section below). -
Use the
pick_next_compaction
to transform the request into a list of compactions. -
For each compaction in the input list of compactions, the compactor_state.rs executes its own
submit_compaction
method that would do the following validations against the compaction_state:-
Check if the count of running compactions is less than the threshold. If yes, continue
-
Check if the source L0 SSTs and SRs are not part of any other compaction. If yes, continue.
-
Check if the destination SR is not part of any other compaction. If yes, continue.
-
Add compaction validations to verify correct group of sources and destinations are selected. Reference: https://github.com/slatedb/slatedb/blob/main/rfcs/0002-compaction.md#compactions.
-
The existing validations in
submit_compaction
method.
-
Note: Invalid compactions would be dropped from the list of compactions during validation.
-
When a compaction successfully validates, the status of the compaction is updated as
submitted
in the compaction_state and added/updated in thenew_compactions
list. -
Try writing the compactor_state to the next sequential .compactor file.
If file exists,
-
If latest .compactor compactor_epoch > current compactor epoch, die (fenced)
-
If latest .compactor compactor_epoch == current compactor epoch, reconcile the compactor_state and pass
new_compactions
object to Step(3). This process would continue until successful .compactor file write or thenew_compaction
object is empty. (This would happen when the compactor has written a .compactor file.) -
If latest .compactor compactor_epoch < current compactor epoch, panic (Compactor_epoch going backwards).
(When this step is successful, compaction is persisted in the .compactor file.)
-
Note: The validations added in this protocol are best effort. The authority to validate a compaction lies with the compactor daemon thread. For more details refer to https://github.com/slatedb/slatedb/pull/695#discussion_r2289989866
Resuming Partial Compactions
Section titled “Resuming Partial Compactions”-
When the output SSTs (part of the partially completed destination SR) are fetched, pick the lastEntry (the lastEntry in lexicographic order) from the last SST of the SR. Possible Approaches:
-
Add a lastKey in the metadata block of SST as suggested here: https://github.com/slatedb/slatedb/pull/695/files#r2243447106 similar to first key and fetch it from the metadata block
-
Once on the relevant SST, go to the last block by iterating the indexes. Iterate to the lastKey of the last block of the SST. (This is the accepted approach we’ll implement.)
-
-
Ignore completed L0 SSTs and move the iterator on each SR to a key >= lastKey on SST partition
-
This is done by doing a binary search on a SR to find the right SST partition and then iterating the blocks of the SST till we find the Entry. [Note: A corner case: With monotonically overlapping SST ranges(specifically the last key), a key might be present across a contiguous range of SST in a SR]
-
Each
{key, seq_number, sst_iterator}
tuple is then added to a min_heap to decide the right order across a group of SRs (this is a way to get a sorted list from all the sorted SR SSTs). -
Once the above is constructed, compaction logic continues to create output SST of 256MB with 4KB blocks each and persists them to the .compactor file by updating the compaction in
compactions
. (This is theCompactionJob
progress section in State Management Protocol.)
Note:
- Step (3) and (4) are already implemented in the
seek()
in merge_iterator. It should handle Tombstones, TTL/Expiration - Ensure the CLI requests are executed on the active Compactor process
Key Design Decisions
Section titled “Key Design Decisions”1. Persistence Boundaries
Section titled “1. Persistence Boundaries”Decision: Persist state at the critical boundary:
- Output SST Completion: Every ~256MB of written data (always persisted)
Rationale: Output SST completions provide the best recovery value per persistence operation. Each represents significant completed work that we don’t want to lose.
2. Enhanced Job Model
Section titled “2. Enhanced Job Model”Decision: Change from 1:1 to 1:N relationship between Compaction and CompactionJob.
Rationale: Enables retry logic, progress tracking, and recovery without breaking existing compaction scheduling logic.
3. State Management Pattern
Section titled “3. State Management Pattern”Decision: Mirror the existing ManifestStore
pattern with CompactorStore
.
Rationale: Reuses proven patterns for atomic updates, version checking, and conflict resolution that are already battle-tested in SlateDB.
4. Recovery Strategy
Section titled “4. Recovery Strategy”- Resume from last completed output SST
The section below is under discussion here: https://github.com/slatedb/slatedb/pull/695/files#r2239561471
Persistent State Storage
Section titled “Persistent State Storage”Object Store Layout
Section titled “Object Store Layout”The compaction state is persisted to the object store following the same CAS pattern as manifests, ensuring consistency and reliability:
/000000001.compactor # First compactor state/000000002.compactor # Updated state after compactions/000000003.compactor # Current state
NOTE: The section below is discussed in detail here.
Protocol for State Management of Manifest and CompactionState
Section titled “Protocol for State Management of Manifest and CompactionState”This a proposal for Statement Management of Manifest and CompactionState. The protocol is based on the following principals:
-
manifest should be source of truth for reader and writer clients (and should thus contain SRs)
-
.compactor file should be an implementation detail of the compactor, not something the DB needs to pay any attention to.
With this view, the entire .compactor file is a single implementation for our built-in (non-distributed) compactor. In fact, for distributed compaction, it need not matter at all. The core of our compaction protocol is simply: the owner of the compactor_epoch may manipulate the SRs and L0s in the manifest.
The .compactor file would serve as an interface over which clients can build their custom distributed compactions say based on etcd (Kubernetes), chitchat, object_store
, etc.
They can have separate files for any approach specific to their state persistence needs.
On startup…
Section titled “On startup…”-
Compactor fetches the latest .manifest file (00005.manifest).
-
Compactor fetches the latest .compactor file (00005.compactor).
-
Manifest increments
compactor_epoch
and Try writing thecompaction_epoch
to the .manifest file (00006.manifest)File version check (in-memory and remote object store): if 00006.manifest exists,
-
If latest .manifest compactor_epoch > current compactor epoch, die (fenced)
-
If latest .manifest compactor_epoch == current compactor epoch, die (fenced)
-
If latest .manifest compactor_epoch < current compactor epoch, increment the .manifest file ID by 1 and retry. This process would continue until successful compactor write. (The current active compactor has updated the .manifest file)
-
-
Try writing above
compactor_epoch
to the dirty CompactionState to the next sequential .compactor position.(00006.compactor).File version check (in-memory and remote object store): if 00006.compactor exists,
-
If latest .compactor compactor_epoch > current compactor epoch, die (fenced)
-
If latest .compactor compactor_epoch == current compactor epoch, panic
-
If latest .compactor compactor_epoch < current compactor epoch, increment the .compactor file ID by 1 and retry. This process would continue until successful compactor write. ( The current active compactor Job would have updated the .compactor file)
-
-
If compactor_epoch in in-memory manifest (00005.manifest) >=
compactor_epoch
, older compactors are fenced now.
At this point, the compactor has been successfully initialised. Any updates to write a new .compactor (00006.compactor) or .manifest file (00006.manifest) by stale compactors would fence them.
On compaction job creation…
Section titled “On compaction job creation…”(Manifest is polled periodically to get the list of L0 SSTs created. The scheduler would create a list of new compactions for these L0 SSTs as well)
-
Compactor Fetches the latest .manifest file during the manifest poll. (00006.manifest)
-
Compactor writes to the next .compactor file the list of scheduled compactions with the empty JobAttempts (00007.compactor in our example).
If the file (00007.compactor) exists,
-
If latest .compactor compactor_epoch > current compactor epoch, die (fenced)
-
If latest .compactor compactor_epoch == current compactor epoch, reconcile the compactor state and go to step (2)
-
If latest .compactor compactor_epoch < current compactor epoch, panic (
compactor_epoch
went backwards)
-
On compaction job progress…
Section titled “On compaction job progress…”-
Compactor writes to the next .compactor file the compactionState (persist when an SST is added to SR) with the latest progress (00008.compactor in our example).
If the file (00008.compactor) exists,
-
If latest .compactor compactor_epoch > current compactor epoch, die (fenced)
-
If latest .compactor compactor_epoch == current compactor epoch, reconcile the compactor state and go to step (2)
-
If latest .compactor compactor_epoch < current compactor epoch, panic (
compactor_epoch
went backwards)
-
On compaction job complete…
Section titled “On compaction job complete…”-
Write the current compactor state (including the completed compaction job) to the next sequential .compactor file(00009.compactor) (steps (1) and (2) in the “progress” section, above).
-
Update in-memory .manifest state (fetched in compaction initiation phase) with the compaction state to reflect the latest SRs/SSTs that were created (and remove old SRs/SSTs).
-
Write the in-memory .manifest state to the next sequential .manifest file. If the file (00007.manifest) exists, it could be due to three possibilities:
-
Writer has written a new manifest.
-
Reader has written a new manifest with new checkpoints.
-
Compactor has written a new manifest.
In any case,
-
If latest .manifest compactor_epoch > current manifest epoch, die (fenced)
-
If latest .manifest compactor_epoch == current manifest epoch, reconcile with the latest manifest and write to the next sequential(00008.manifest) .manifest file
(Writer could have flushed L0 SSTs or updated checkpoint to the manifest)
- If latest .manifest compactor_epoch < current manifest epoch, panic
(
compactor_epoch
went backwards)
-
Summarised Protocol
Section titled “Summarised Protocol”1. Compactor A fetches the latest .manifest file (00001.manifest)
2. Compactor A fetches the latest .compactor file (00001.compactor)
3. Compactor A writes .compactor file (00002.compactor) with compactor_epoch (compactor_epoch = 1)
4. If compactor_epoch in .manifest file (00001.manifest) >= compactor_epoch (compactor_epoch = 1), fenced
5. Compactor A writes .manifest file (00002.manifest) with compactor_epoch (compactor_epoch = 1)
(Compactions are scheduled based on the latest manifest poll and CompactionJob updates the .compactor file with the in progress SR state)
After compactionJob completion...,
6. Update in-memory .manifest (00002.manifest) state to reflect the latest SRs/SSTs that were created (and remove old SRs/SSTs) from the latest .compactor file.
7. Write the in-memory .manifest state to the next sequential .manifest file.
8. If the file (00007.manifest) exists, it could be due to three possibilities: - Writer has written a new manifest. - Reader has written a new manifest. - Compactor has written a new manifest.
In any case, compare the `compactor_epoch` in both .manifest file and the local manifest file and write manifest to next sequential file if the compactor is not fenced.
Race conditions handled in the protocol
Section titled “Race conditions handled in the protocol”Incorrect Read order of manifest and compactionState
Section titled “Incorrect Read order of manifest and compactionState”Compactor 1 reads .compactor(compactor_epoch=1, [SR0, SR1, SR2])
Compactor 2 updates .compactor(compactor_epoch=2, [SR0, SR1, SR2])
Compactor 2 updates .compactor(compactor_epoch=2, [SR2])
Compactor 2 updates .manifest(compactor_epoch=2, [SR2])
Compactor 1 reads .manifest ([SR2])
Compactor 1 writes .manifest ([SR1, SR2]) // undoes Compactor 2's change when it should be fenced
Fenced Compactor Process trying to update manifest
Section titled “Fenced Compactor Process trying to update manifest”.manifest file : [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0]
At T = 0, Compactor A starts(compactor_epoch = 1), updates by creating a sequential .compactor file.manifest file: [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0]
At T = 1, Compactor A (compactor_epoch = 1), updates by creating a sequential .compactor file.manifest file: [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR4(merged), SR3, SR2, SR1, SR0]
At T = 3, Compactor B starts(compactor_epoch = 2), updates by creating a sequential .compactor file (Compactor A is fenced).manifest file: [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR4(merged), SR3, SR2, SR1, SR0]
At T = 4, Compactor B (compactor_epoch = 2), updates by creating a sequential .compactor file.manifest file: [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR4(merged), SR0(merged)]
At T = 5, Compactor B updates by creating a sequential .manifest file
At T = 6, Compactor A updates by creating a sequential .manifest file (Fenced Compactor updating Manifest)
Note: The protocol still allows fenced compactor to update the manifest if they are in order because compactor is always syncing compaction state. However, it would get fenced if the file already exists. Consider the following case:
Fenced Compactor Process trying to update manifest
Section titled “Fenced Compactor Process trying to update manifest”.manifest file : [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0]
At T = 0, Compactor A starts(compactor_epoch = 1),.manifest file: [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0]
At T = 1, Compactor A (compactor_epoch = 1), updates by creating a sequential .compactor file.manifest file: [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR4(merged), SR3, SR2, SR1, SR0]
At T = 3, Compactor B starts(compactor_epoch = 2),.manifest file: [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR4(merged), SR3, SR2, SR1, SR0] (Compactor A is fenced)
At T = 4, Compactor B updates (compactor_epoch = 2), updates .compactor file.manifest file: [SR7, SR6, SR5, SR4, SR3, SR2, SR1, SR0],.compactor file : [SR4(merged), SR0(merged)]
At T = 5, Compactor A updates .manifest file [Compactor is fenced but can still update manifest]
At T = 6, Compactor B updates .manifest file
Gaps in compactor_epoch in .compactor file
Section titled “Gaps in compactor_epoch in .compactor file”Compactor 1 reads latest .manifest file (00005.manifest, compactor_epoch = 1)Compactor 1 reads latest .compactor file (00005.compactor, compactor_epoch = 1)Compactor 1 writes .manifest file (00006.manifest, compactor_epoch = 2)Compactor 2 reads latest .manifest file (00006.manifest, compactor_epoch = 2)Compactor 2 reads latest .compactor file (00005.compactor, compactor_epoch = 1)Compactor 2 writes .manifest file (00007.manifest, compactor_epoch = 3)Compactor 2 writes .compactor file (00006.compactor, compactor_epoch = 3)Compactor 1 writes .manifest file (00006.manifest, compactor_epoch = 2) (fenced)
Note: The above protocol enables us to use the existing compaction logic for merging L0 SSTs/SRs between manifest and compactionState. Hence, that is not added as part of this protocol.
External Process Integration
Section titled “External Process Integration”Administrative Commands:
slatedb compaction submit --sources SR1,SR2
- Submit manual compactionslatedb compaction status --id <compaction-id>
- Status of a compaction
We leverage admin.rs
to expose methods that would be triggered during manual compactions requests from external process / CLI
// API method signature uses the public struct directlypub async fn submit_manual_compaction( &self, source_ssts: Vec<String>, source_srs: Vec<String>) -> Result<CompactionInfo, Error>
pub async fn get_compaction_info( &self, id: String) -> Result<CompactionInfo, Error> // ← Returns public struct directly
/// Status of a compaction job to be shown to the customer#[derive(Debug, Clone, PartialEq)]pub enum CompactionStatusResponse { Submitted, // Waiting to be scheduled InProgress, // Currently executing Completed, // Successfully finished Failed, // Failed with error}
/// Progress information for an active compaction#[derive(Debug, Clone)]pub struct CompactionProgressResponse { /// Number of input SSTs processed so far pub input_ssts_processed: usize, /// Total number of input SSTs to process pub total_input_ssts: usize, /// Number of output SSTs written pub output_ssts_written: usize, /// Total bytes processed from input pub bytes_processed: u64, /// Completion percentage (0.0 to 100.0) pub completion_percentage: f64, /// Estimated completion time pub estimated_completion: Option<DateTime<Utc>>,}
/// Detailed information about a compaction#[derive(Debug, Clone)]pub struct CompactionInfo { /// Unique identifier for the compaction pub id: String, /// Current status pub status: CompactionStatusResponse, /// Source SSTs being compacted pub source_ssts: Vec<String>, /// Source SRs being compacted pub source_srs: Vec<String>, /// Target Destination of compaction pub target: String, /// Current progress (if running) pub progress: Option<CompactionProgressResponse>, /// When the compaction was created pub created_at: DateTime<Utc>, /// When the compaction started (if applicable) pub started_at: Option<DateTime<Utc>>, /// When the compaction completed (if applicable) pub completed_at: Option<DateTime<Utc>>, /// Error message (if failed) pub error_message: Option<String>,}
Garbage Collection Integration
Section titled “Garbage Collection Integration”The garbage collection of .compactor file can leverage the existing logic of garbage collecting .sst files and .manifest files.
The .sst file is deemed to be garbage collected if it satisfies the following conditions:
-
SST is older than the min age configured.
-
SST is not referenced by any active manifest checkpoint.
Note: We would also need to handle the scenario mentioned in https://github.com/slatedb/slatedb/issues/604 to avoid deletion of compacted SSTs and prevent data corruption.
The .manifest file is deemed to be garbage collected if it satisfies the following conditions:
-
Avoid deletion of the latest manifest
-
Delete manifest not referenced by active checkpoints
-
Delete manifests that have passed the min_age
The .compactor file can be cleaned by the garbage collector similar to .manifest file garbage collection conditions:
-
Avoid deletion of the latest .compactor file
-
Delete compactor files that have passed the min_age
Observability Enhancements
Section titled “Observability Enhancements”Progress Tracking
Section titled “Progress Tracking”- Real-time progress: Bytes processed, SSTs completed, estimated completion time
- Phase tracking: Reading inputs → Writing outputs → Updating manifest → Completed
- Recovery metrics: Work preservation percentage, recovery time
Statistics
Section titled “Statistics”- Performance: Throughput, duration, success rates by compaction size
- Recovery: Jobs recovered, average recovery time, work preservation
- Errors: Categorized by type (network, memory, corruption) for retry decisions
- Cost: Persistence operations, overhead percentage
Cost Analysis
Section titled “Cost Analysis”Operation Count Breakdown
Section titled “Operation Count Breakdown”For a typical 40GB compaction (160 input SSTs → ~160 output SSTs):
Baseline compaction operations:
- 160 SST reads: Reading input SST files
- 160 SST writes: Writing output SST files
- 2 manifest updates: Initial job start + final completion
- Total baseline: 322 operations
With persistence enabled:
- 160 SST reads: Input SST files (unchanged)
- 160 SST writes: Output SST files (unchanged)
- 160 state writes: Persistence after each output SST (~256MB intervals)
- 2 manifest updates: Job lifecycle management
- Total with persistence: 482 operations
Overhead calculation:
- Additional operations: 160 (482 - 322)
- Percentage increase: +50% operations
- Operations per GB: ~4.0 additional ops/GB (160 ÷ 40GB)
Cloud Cost Analysis
Section titled “Cloud Cost Analysis”Using AWS S3 Standard pricing:
- PUT operations: $0.0005 per 1,000 requests
- GET operations: $0.0004 per 1,000 requests
- DELETE operations: $0.0005 per 1,000 requests
Baseline costs:
- 160 PUTs (SST writes): $0.000080
- 160 GETs (SST reads): $0.000064
- 2 PUTs (manifest): $0.000001
- Total baseline: $0.000145
Additional persistence costs:
- 160 PUTs (state writes): $0.000080
- Additional total: $0.000080
Cost impact:
- Additional cost: $0.000080 (~$0.00008)
- Percentage increase: +50% operations, but negligible absolute cost
- Cost per GB: ~$0.000002 per GB compacted
Recovery Efficiency Analysis
Section titled “Recovery Efficiency Analysis”Work Preservation Calculation
Section titled “Work Preservation Calculation”Without persistence:
- Recovery strategy: Restart compaction from beginning
- Work preserved: 0% (all progress lost)
- Additional operations: Full re-execution (322 operations repeated)
- Time impact: 100% of original compaction time
With persistence:
- Average failure point: 50% through compaction (statistical)
- Work preserved: ~50% of progress maintained
- Recovery operations: Resume from last checkpoint
- Time impact: ~50% of original compaction time saved
Detailed recovery scenarios:
Failure Point | Without Persistence | With Persistence | Work Preserved | Operations Saved(work preserved percentage * 322) |
---|---|---|---|---|
10% complete | Restart (0% preserved) | Resume from 10% (10% preserved) | 10% | 32 operations |
25% complete | Restart (0% preserved) | Resume from 25% (25% preserved) | 25% | 81 operations |
50% complete | Restart (0% preserved) | Resume from 50% (50% preserved) | 50% | 161 operations |
75% complete | Restart (0% preserved) | Resume from 75% (75% preserved) | 75% | 242 operations |
90% complete | Restart (0% preserved) | Resume from 90% (90% preserved) | 90% | 290 operations |
Average work preservation: 50% across all failure scenarios
Scaling Analysis
Section titled “Scaling Analysis”Compaction Size | SSTs | Base Ops | +Persistence | Additional Cost | % Overhead |
---|---|---|---|---|---|
10GB (40 SSTs) | 40 | 82 | 122 | $0.000020 | +49% |
40GB (160 SSTs) | 160 | 322 | 482 | $0.000080 | +50% |
100GB (400 SSTs) | 400 | 802 | 1,202 | $0.000200 | +50% |
1TB (4,000 SSTs) | 4,000 | 8,002 | 12,002 | $0.002000 | +50% |
Key observations:
- Operations increase by ~50%, but absolute costs remain minimal
- Cost scales linearly with compaction size (~$0.000002/GB)
- Percentage overhead is consistent at ~50% across all sizes
- Total costs are negligible compared to storage and compute costs
Future Extensions
Section titled “Future Extensions”- Persistent state provides foundation for multi-compactor coordination and work distribution.
- Define a minimum time boundary between compaction file updates to prevent excessive writes to the file (see https://github.com/slatedb/slatedb/pull/695#discussion_r2229977189)
- Add last_key to SST metadata to enable efficient range-based SST filtering during compaction source selection and range query execution.