Synchronous Commit & Durability
Status: Accepted
Authors:
Background
Section titled “Background”The discussion of commit semantics and durability began in the comments of <https://github.com/slatedb/slatedb/pull/260#issuecomment-2570658593>. As the discussion evolved, it became clear that this is a complex topic. The semantics of commit and durability involve many intricate details, with subtle but important differences between various approaches.
We’re creating this RFC to facilitate a thorough discussion of these concepts in the code review board.
This RFC aims to:
- Define clear synchronous commit semantics and durability guarantees that SlateDB users can rely on with confidence.
- Create a well-defined API allowing users to specify their commit semantics and durability requirements.
- Account for tiered WAL in the design.
- Plan and organize the necessary code changes to implement these features.
As previously discussed in meetings and comments, we intend to implement these changes in an additive way that preserves all existing capabilities.
References
Section titled “References”Other Systems
Section titled “Other Systems”Let’s examine how other systems handle synchronous commits and durability guarantees before diving into the design. We’ll focus on these key aspects:
- The API for users to specify their commit semantics & durability requirements
- Use cases & trade-offs, and the default settings
- Error handling
PostgreSQL
Section titled “PostgreSQL”PostgreSQL provides a flexible setting called synchronous_commit
that controls transaction durability and commit behavior. It offers several levels:
off
: Commits complete immediately after the transaction finishes, without waiting for the WAL to be written to disk. This means data loss is possible if a crash occurs.local
: Commits wait for the WAL to be written and flushed to the local disk before returning.on
(default): Commits wait for the WAL to be written and flushed locally, plus wait for at least one standby server to apply the WAL if synchronous replication is configured.remote_write
: Commits wait for the WAL to be written locally and replicated to standby servers, then wait for standbys to flush to their file systems.remote_apply
: Commits wait for the WAL to be written and flushed locally, then wait for standbys to fully apply the WAL, this ensures the data consistent between primary and standby.
The article “Understanding synchronous_commit in PostgreSQL” includes a helpful diagram showing how the on
and off
settings work (converted into a mermaid sequence diagram by @criccomini with ChatGPT):
sequenceDiagram participant Client participant ClientBackend participant SharedBuffer participant WALBuffer participant WALWriter participant Disk Client ->> ClientBackend: Execute DML ClientBackend ->> SharedBuffer: Write Data ClientBackend ->> WALBuffer: Write WAL alt synchronous_commit is off ClientBackend -->> Client: Return Success (if off) WALBuffer ->> WALWriter: Asynchronous WAL write WALWriter ->> Disk: Asynchronous Write else synchronous_commit is on WALBuffer ->> WALWriter: Wait for WALWriter WALWriter ->> Disk: Flush WAL ClientBackend -->> Client: Return Success (if on) end
The key distinction between on
and off
is that on
ensures the WAL is written and flushed to local storage before proceeding. This highlights a central theme throughout this RFC: synchronous commit and durability fundamentally revolve around WAL handling.
Let’s summarize the use cases and trade-offs for different synchronous_commit
levels:
- For financial systems where data is highly sensitive and data loss is unacceptable,
on
,remote_write
, orremote_apply
should be used. - For mission-critical systems that cannot tolerate data inconsistency between primary and standby servers after a failover,
remote_apply
should be used to guarantee data consistency across servers. - For workloads like logging or stream processing where some data loss is acceptable and performance is paramount,
off
orlocal
can be used to optimize throughput.
RocksDB
Section titled “RocksDB”RocksDB describes synchronous commit in their documentation as follows:
> #### Non-Sync Mode > > When WriteOptions.sync = false (the default), WAL writes are not synchronized to disk. Unless the operating system thinks it must flush the data (e.g. too many dirty pages), users don’t need to wait for any I/O for write. > > Users who want to even reduce the CPU of latency introduced by writing to OS page cache, can choose Options.manual_wal_flush = true. With this option, WAL writes are not even flushed to the file system page cache, but kept in RocksDB. Users need to call DB::FlushWAL() to have buffered entries go to the file system. > > Users can call DB::SyncWAL() to force fsync WAL files. The function will not block writes being executed in other threads. > > In this mode, the WAL write is not crash safe. > > #### Sync Mode > > When WriteOptions.sync = true, the WAL file is fsync’ed before returning to the user. > > #### Group Commit > > As most other systems relying on logs, RocksDB supports group commit to improve WAL writing throughput, as well as write amplification. RocksDB’s group commit is implemented in a naive way: when different threads are writing to the same DB at the same time, all outstanding writes that qualify to be combined will be combined together and write to WAL once, with one fsync. In this way, more writes can be completed by the same number of I/Os. > > Writes with different write options might disqualify themselves to be combined. The maximum group size is 1MB. RocksDB won’t try to increase batch size by proactive delaying the writes.
Like PostgreSQL, RocksDB provides a sync
option to control commit behavior and durability guarantees. When sync = true
, a write is not considered committed until the data is fsync()
ed to storage.
When sync = false
, a write is considered committed immediately after the transaction completes, without waiting for the WAL to be written. While the WAL is still buffered in the kernel’s page cache, data loss can occur if a crash happens.
Unlike PostgreSQL’s synchronous_commit
which offers multiple levels, RocksDB only provides a simple boolean option. This is because RocksDB is an embedded database and doesn’t have the primary/standby architecture that PostgreSQL has.
To optimize synchronous commit performance, RocksDB implements Group Commit, which is a common pattern in WAL-based systems. This mechanism batches multiple writes together into a single, larger WAL write and flush operation, which significantly improves I/O throughput compared to multiple small writes.
(SlateDB implemented a similar Group Commit mechanism through its Commit Pipeline, where multiple writes with await_durable: true
will be batched into a single WAL write after flush.interval
seconds or when the WAL buffer is full.)
It’s worth noting that RocksDB defaults to sync = false
, meaning WAL writes are not crash-safe by default.
This default is likely a performance trade-off. In many distributed systems (RocksDB’s primary use case), some data loss on individual nodes is acceptable without compromising overall system durability. Examples include Raft clusters, distributed key-value stores, and stream processing state stores. For these use cases, enabling sync: false
or manual_wal_flush: true
can be beneficial for performance.
RocksDB allows mixing writes with different sync settings. For example, if transaction A commits with sync = false
and transaction B starts afterwards, transaction A’s writes will be visible to readers in transaction B. When transaction B commits with sync = true
, both transactions’ writes are persisted. This ordering guarantee means that when a sync = true
write commits, all previous writes are guaranteed to be persisted as well.
Another important consideration is WAL write failures. RocksDB handles these by retrying writes until determining whether the failure is temporary or permanent. In cases of permanent failure (like a full or corrupted disk), RocksDB marks the database state as fatal, rolls back the transaction, and switches the database instance to read-only mode.
Synchronous Commit in Summary
Section titled “Synchronous Commit in Summary”Based on the PostgreSQL and RocksDB references above, we can summarize the key semantics of Synchronous Commit:
- A write is only considered committed once the WAL has been persisted to storage. Until then, the data remains invisible to readers.
- If there is a permanent failure while persisting the WAL during a Synchronous Commit, the transaction rolls back. The database instance enters a fatal state and switches to read-only mode.
- It’s possible to have multiple levels of Synchronous Commit, or even disable it, allowing users to balance performance and durability requirements.
- Synchronous and Unsynchronous Commits can be interleaved in different transactions. A transaction using Synchronous Commit can read writes from transactions that used Unsynchronous Commit. When a Synchronous Commit persists, it also persists any previous Unsynchronous Commit writes in the WAL.
Current Design in SlateDB
Section titled “Current Design in SlateDB”This section is based on @criccomini’s comment in <https://github.com/slatedb/slatedb/pull/260#issuecomment-2576502212>.
SlateDB currently does not provide an explicit notion of Synchronous Commit. However, it does provide a DurabilityLevel
enum to control the durability guarantees on both read and write operations.
The DurabilityLevel
enum is defined as follows:
enum DurabilityLevel { Memory, Local, // not implemented yet Remote,}
The WriteOptions
struct contains an await_durability: DurabilityLevel
option to control the waiting behavior for durability. If await_durability
is set to DurabilityLevel::Remote
, the write will wait for the WAL to be written into S3 before returning.
It’s important to note that SlateDB’s commit semantics differ from other systems’ Synchronous Commit. Regardless of what DurabilityLevel
is set in the write operation, this write is considered visible to readers with DurabilityLevel::Memory
immediately after the write is appended to the WAL, not necessarily flushed to storage.
This difference exists because SlateDB’s WAL serves not only for crash recovery but also for data reads. The read path first accesses the WAL, then MemTable, then L0 SST, and finally SSTs at deeper levels.
In traditional Synchronous Commit semantics, data is considered committed only after the write is persisted to WAL storage. Users can specify the durability level as DurabilityLevel::Remote
for read calls to ensure only committed/persisted data is read.
SlateDB differs from both PostgreSQL and RocksDB in multiple ways:
- Unlike PostgreSQL, it’s not a distributed system with Primary/Standby nodes.
- Unlike RocksDB, it stores data in S3 rather than local disk, resulting in slower write operations and additional costs for API requests.
These differences lead to several key considerations:
- Group commit is essential. By batching multiple writes together, we can reduce both API costs and improve performance compared to multiple small writes. However, even with Group Commit, writes to S3 will still be slower than local disk writes.
- Writes will inherently take longer due to S3 latency. Given this reality, it makes sense to allow readers who can accept eventual consistency to access unpersisted and uncommitted data while waiting for writes to be durably committed.
- Writing WAL to S3 has a higher risk of permanent failures due to network instability compared to local disk. This makes it critical to implement robust auto-recovery mechanisms for handling I/O failures. 1
- We should provide a way to control the durability level of read operations, so users can choose to read only persisted data.
These unique characteristics of SlateDB must be carefully considered as we design our durability and commit semantics.
Possible Improvements
Section titled “Possible Improvements”Synchronous Commit is a critical feature for mission-critical systems. It guarantees full ACID compliance by ensuring writes remain invisible until they are committed to durable storage. It also allows for different levels of durability guarantees to balance various use cases and trade-offs.
However, when comparing SlateDB’s current model with PostgreSQL and RocksDB’s Synchronous Commit implementations, there are some challenges in replicating the same semantics.
For example, in a transaction intended to be synchronously committed, the write should not be considered committed until the data is flushed to storage. But in SlateDB’s current model, the data becomes visible to readers accepting unpersisted data (using DurabilityLevel::Memory
) as soon as the write is appended to the WAL - before it’s actually persisted to storage. This means uncommitted data can potentially be read before it’s durably committed.
If users want to avoid reading uncommitted data, they can use DurabilityLevel::Remote
to ensure they only read persisted data. However, this approach has drawbacks within transactions. If other writers make unpersisted writes (DurabilityLevel::Memory
) to the same keys that the transaction is accessing, it will constantly encounter conflicts and rollbacks.
In short:
- We cannot guarantee Synchronous Commit semantics by setting writers to
DurabilityLevel::Remote
and readers toDurabilityLevel::Memory
, since uncommitted data may be visible before it’s durably committed. - Setting both writers and readers to
DurabilityLevel::Remote
also presents challenges, as transactions may frequently roll back due to conflicts with unpersisted writes made by other writers usingDurabilityLevel::Memory
on the same keys.
Proposal
Section titled “Proposal”Read Committed Data by Default
Section titled “Read Committed Data by Default”This proposal aims to provide users with true Synchronous Commit semantics while preserving all capabilities of the current model.
It’s important to note that “Commit” and “Durability” are distinct concepts that aren’t necessarily coupled. Data can be considered “Committed” even if it hasn’t been flushed to persistent storage - it may exist only in memory. While such data isn’t persisted, it can still be treated as safely committed from a transactional perspective.
Later transactions can safely depend on these “Unpersisted” but “Committed” data without worrying about conflicts, since they represent the latest committed state of the data. This allows for consistent transaction semantics even when some committed data hasn’t yet been persisted to storage.
“Committed” data won’t contain data that is still in the process of being committed (which is possible in reads with DurabilityLevel::Memory
in the current model).
By allowing users to read “Committed” data by default, we can address the challenges outlined earlier. Reading committed data, regardless of persistence status, enables proper transaction semantics while still allowing for performance optimizations through writes with lower durability requirements. This provides a cleaner separation between transaction consistency and durability guarantees.
Let’s assume we have a sequence of writes like this:
seq | operation | marks |
---|---|---|
100 | WRITE key001 = “value001” with (durability: Memory) | |
101 | WRITE key002 = “value002” with (durability: Memory) | |
102 | WRITE key003 = “value003” with (durability: Memory) | |
103 | WRITE key004 = “value004” with (durability: Remote) | <— last remote persisted seq |
104 | WRITE key005 = “value005” with (durability: Memory) | |
105 | WRITE key006 = “value006” with (durability: Memory) | <— last committed seq |
106 | WRITE key007 = “value007” with (durability: Remote), but still haven’t persisted yet | <— last committing seq |
While “Committed” is distinct from the “DurabilityLevel” concept, both the “Committed” and “Persisted” positions can be tracked using separate watermarks in the commit history of sequence numbers.
Reading “Committed” data should be considered the default behavior for read operations, as committed data is safe data.
In some cases, users may want to read only “Persisted” data, and they can use the durability_filter
option to achieve this goal:
let opts = ReadOptions::new().with_durability_filter(DurabilityLevel::Local);db.get(key, opts).await?;
Using this durability filter option, the read operation will only retrieve data that has been persisted to storage. This will make sure the data it read is at least durable at DurabilityLevel::Local
, but it might not be the most up-to-date.
By default, no matter what DurabilityLevel
is set in the durability_filter
option, the read operation will always read the Committed data. But at some use cases like disk caching, users may not care about the commitness of the data, but only care about the data is durable at least at DurabilityLevel::Local
.
If the writer write with SyncLevel::Remote
, the data will not be considered as committed until it’s persisted to Remote
. If the user want to read the data as soon as it’s persisted to Local
, they can use an additional option dirty: true
to make this possible:
let opts = ReadOptions::new() .with_durability_filter(DurabilityLevel::Local) .with_dirty(true);db.get(key, opts).await?;
Sync Commit
Section titled “Sync Commit”To implement Synchronous Commit semantics, the write side do not need to change a lot: the operation returns once the WAL is persisted according to the specified durability level.
In other systems, this option is often named as sync
, to emphasize the idea of synchronous commit, like:
let opts = WriteOptions::new().with_sync(DurabilityLevel::Remote);db.write(key, value, opts).await?;
However, for the writes that not expected to durablely persisted, this can be thought of as disabling synchronous commit entirely rather than synchronizing to memory. The code sample may be like:
enum SyncLevel { Off, Local, Remote,}
let opts = WriteOptions::new().with_sync(SyncLevel::Off);db.write(key, value, opts).await?;
The benefits of having a SyncLevel
enum are:
- The name “Synchronous Commit” directly describes the feature we aim to support.
- The term “sync” is commonly used across database systems and is well understood, familiar to users.
- The names are shorter, easier to type & read.
- It’s accurate to describe as sync commit as “off” when it comes with memory durability.
At the writer side, if sync
is enabled, the write operation will wait until the data is persisted to storage according to the specified sync level.
If sync
is set to SyncLevel::Off
, the write operation will not await it self to be persisted remotely.
But there’s one worth noting: the commits are strictly ordered. The later write will wait for the previous write to be committed to commit itself. As a result, the write operation with SyncLevel::Off
does NOT means the write will become visible immediately.
Give an example:
- thread A: seq 100: write keyXXX with
sync: Remote
- thread B: seq 101: write keyYYY with
sync: Off
The write operation B will NOT return immediately, but it will be queued in the WAL buffer, and await the previous write operation A to be committed, then it will be applied to MemTable and visible to readers, and finally return.
commit()
always ensures “read your writes consistency”, as a result, it’s still a blocking operation even with sync: Off
. If we want to have a no-wait write, we can consider add something like commit_with_callback(cb: Callback)
interface, like badger did. This async commit interface is out of scope of this RFC, we can discuss it in a separate RFC or issue if needed.
Difference between “await_durable” and “sync”
Section titled “Difference between “await_durable” and “sync””await_durable
and sync
appear to have similar behaviors - on the write side, both wait until data reaches a persistent state before returning.
However, there are subtle semantic differences between them:
sync
determines when data enters the committed state and returns after data is committed.await_durable
only waits for the data’s durability state.
For example, when a write has Sync::Off
set, it is considered to enter the Committed state immediately, this write will become visible to readers with LastCommitted
immediately.
However, at some point this data will still be eventually flushed to storage if the DB keeps running. If you want to subscribe or wait for when this sync:Off
write becomes durably persisted, you can still use await_durable
to achieve this goal.
Also, when using sync
commit, it’s possible to queue the later Non-Sync writes to be queued in the WAL buffer, delaying them later to be applied to MemTable and visible to readers.
Sometimes, user might hope to always allow the writes to be visible to readers immediately, while still waiting for the data to be durably persisted. await_durable
and sync
can be used together to achieve this goal:
let opts = WriteOptions::new() .with_sync(SyncLevel::Off) .with_await_durable(DurabilityLevel::Remote) // on the writer side, i still await this until it's persisted in remotedb.put_with_options("key", "val", opts)
When WAL is disabled (no-WAL mode), the sync
option has no effect since there is no WAL to sync. In this case, users should use await_durable
to ensure their writes are persisted durably.
While SyncLevel
and DurabilityLevel
may appear similar, they serve distinct purposes. SyncLevel
controls when writes become visible to readers by determining the commit semantics, while DurabilityLevel
focuses on the durability of the write.
Implementation
Section titled “Implementation”The key change is to separate “Append to WAL” and “Commit to make it visible” into two different steps.
In the current design, whenever a write operation is called, it’ll append the data to the WAL, the data is becoming visible to readers with DurabilityLevel::Memory
immediately, as the readers will read the data from the WAL in front of MemTable. And when the WAL buffer reaches the flush.interval
or flush.size
, it’ll be flushed to storage, and apply to MemTable.
The sequence diagram is like this:
sequenceDiagram participant WriteBatch participant WAL participant MemTable participant Flusher WriteBatch->>WAL: append write op to current WAL Note over WAL: become visible to readers with DurabilityLevel::Memory WAL->>WAL: maybe trigger a flush Note over WAL: freeze the current WAL, and flush this WAL WAL->>Flusher: Flush WAL Flusher-->>WAL: ACK FlushWAL message WAL->>MemTable: Merge WAL to MemTable
In this proposal, “Commit to make it visible” effectively means “applying the changes to the MemTable”, whenever the change is applied to MemTable, it’s considered as committed.
The changes to the read path are minimal. For reads with different durability levels, we can simply use different sequence numbers as watermarks to determine visibility.
On the write side, after the data is appended to the WAL, it’ll be applied to MemTable as soon as possible to make the change to be visible to readers.
The sequence diagram is like this:
sequenceDiagram participant WriteBatch participant WAL participant MemTable participant Flusher WriteBatch->>WAL: append write op to current WAL WAL->>WAL: Maybe trigger a flush Note over WAL: when it reaches flush.interval or flush.size WAL->>Flusher: Flush WAL Flusher-->>WAL: ack FlushWAL message, increment last wal flushed position WAL->>MemTable: apply write op to MemTable Note over MemTable: become visible to readers with ReadWaterMark::LastCommitted alt sync is Remote WriteBatch->>WAL: await last applied position end MemTable-->>WriteBatch: wake up waiters of last applied position
Handling No WAL Write
Section titled “Handling No WAL Write”It’s not possible to have Sync Commit when WAL is disabled. So let’s not allow users to enable sync
when WAL is disabled.
When sync
is not possible to set, all the writes will be considered as Committed immediately, and become visible to readers in the MemTable.
Users can still use await_durable
to wait for the write to be durably persisted on the writer side.
Handling WAL Write Failures
Section titled “Handling WAL Write Failures”It’s more likely to have WAL write failures on object storage like S3 than local disk.
We do not have to fail the write operation immediately when the WAL write fails. Instead, we can retry the write operation with a backoff.
However, it’s possible to have some failure on flushing WAL to object storage which last for several minutes or even longer, and without a local disk which might help to buffer the WAL during the failure.
In this case, we can still buffer the WAL in memory as long as possible. When the failure is resolved, we can resume the flush operation.
However, the memory is limited, and we can’t buffer the WAL in memory forever. We already have a threshold setting max_unflushed_bytes
for this. When this threshold is reached, we can’t buffer the WAL anymore, and we have to mark the db into an read-only state.
When the WAL fails to be flushed to storage, the write operation with SyncLevel::Remote
should be failed with an IOError
(or a concrete error type), and considered as not committed. This write operation should be invisible to all of the readers.
Possible Code Changes
Section titled “Possible Code Changes”As above described, we need to tackle with several details to implement this proposal, like:
- The synchronization of flushing the WAL to storage.
- Notify the writers when the WAL is successfully flushed to storage.
- Retry the flush operation when it fails with backoff, and finally mark the db into an read-only state if it’s permanent failure.
- Flush the memory buffered WAL to storage in order.
- Support tiered WAL storage like local disk and S3.
- Handle the no-WAL mode.
Given the complexity of these changes, implementing everything in a single PR would be challenging. It’s difficult to estimate the implementation effort required or identify all the code that will need to change in this RFC. A better approach would be to break this down into several smaller PRs, starting with some small refactoring to make the subsequent changes easier to be implemented.
One possible code change would be introducing a WALManager
struct to centralize management of the WAL buffer and flush operations, helping to encapsulate the complexity in a single place.
For inspiration, we can also consider to reference how Badger structures their code. In Badger’s implementation, they embed the WAL buffer inside the memtable struct, as shown here:
// memTable structure stores a skiplist and a corresponding WAL. Writes to memTable are written// both to the WAL and the skiplist. On a crash, the WAL is replayed to bring the skiplist back to// its pre-crash form.type memTable struct { sl *skl.Skiplist wal *logFile maxVersion uint64 opt Options buf *bytes.Buffer}
It makes sense to put the manager of the WAL buffer inside the memtable struct, as WAL is closely related to the memtable, it actually make a good encapsulation: put the complexities behind some simple put()
/ get()
interface.
However, it’s a different codebase, it would be better to keep code structure changes minimal with each iteration. It’ll make more sense to have several small PoC PRs. Introducing a WALManager
might be a good first step to encapsulate the WAL buffer, flushing, and synchronous commit functionality. This would allow us to have more detailed discussions about code structure as we develop these PoCs.
Updates
Section titled “Updates”- 2025-02-20: added the comparison between
sync
andawait_durable
- 2025-03-12: revise the api with
with_durability_filter
andwith_dirty
Footnotes
Section titled “Footnotes”-
The discussion in RFC for Handling Data Corruption may also be related to this topic. ↩