Distributed System Fundamentals: Quorum, Consensus, and Distributed Transactions

Mohammad Tanvir Chowdhury
backend
Distributed System Fundamentals: Quorum, Consensus, and Distributed Transactions

Introduction

A distributed system promises things a single server can't: fault tolerance, horizontal scalability, geographic reach. But the moment you spread your data and logic across multiple machines, you inherit a new category of problems.

Machines fail. Networks partition. Two nodes disagree about who's in charge. A transaction spans three services and one of them crashes mid-way. These aren't edge cases — they're the normal operating conditions of any system running at scale.

This post covers the core building blocks of distributed systems that every backend engineer needs to understand:

  • Quorum — how distributed systems reach agreement despite failures
  • Consensus — the problem of getting multiple nodes to agree on one thing
  • Raft and Paxos — the algorithms that actually solve consensus in production
  • Two-Phase Commit (2PC) — coordinating transactions across multiple systems
  • Three-Phase Commit (3PC) — the improvement that didn't quite make it into production
  • The SAGA pattern — the practical alternative to 2PC in microservices
  • Split-brain — what happens when a cluster disagrees about who's the leader
  • Network partitions — the failure mode that makes distributed systems hard
  • Fencing — how systems prevent split-brain from corrupting data

Section 1 — Quorum

🔹 Quorum

Simple Explanation

A quorum is the minimum number of nodes that must participate in an operation for it to be considered valid. In a distributed system, you can't always reach every node — some may be down, some may be slow. A quorum lets the system make progress as long as a majority of nodes are available, while guaranteeing that any two quorums overlap by at least one node.

That overlap is what makes quorums work. If every successful write reaches a quorum of nodes, and every read queries a quorum of nodes, at least one node in the read quorum must have seen the latest write.

The formula

For a cluster of N nodes, a quorum requires at least:

quorum = floor(N / 2) + 1

In plain terms: a majority.

3 nodes → quorum = 2  (can tolerate 1 failure)
5 nodes → quorum = 3  (can tolerate 2 failures)
7 nodes → quorum = 4  (can tolerate 3 failures)

A cluster of 2f+1 nodes can tolerate f simultaneous failures.

Analogy

A jury requires 12 of 12 jurors to convict. That's a unanimous requirement — any single juror can block a decision. A quorum is more practical: require 7 of 12 (a majority) to agree. As long as more than half agree, the system moves forward. And any two groups of 7 share at least 2 members — so they can't make contradictory decisions without someone knowing about both.

Mini Diagram

5-node cluster, quorum = 3

Write operation:
→ Node 1: confirmed ✓
→ Node 2: confirmed ✓
→ Node 3: confirmed ✓
→ Node 4: down ✗
→ Node 5: down ✗

3/5 confirmed → quorum met → write succeeds ✓

Read operation (later):
→ Node 1: confirmed ✓
→ Node 2: confirmed ✓
→ Node 3: confirmed ✓

3/5 responded → at least one node has the latest write → consistent ✓

Why even-numbered clusters are a bad idea

A 4-node cluster has a quorum of 3 — it can only tolerate 1 failure, same as a 3-node cluster. But it costs more to run and adds no additional fault tolerance. Production clusters almost always use odd numbers (3, 5, 7). Adding a 4th node buys nothing. Adding a 5th gives you one more tolerated failure.

The W + R > N rule

The quorum consistency rule for reads and writes:

W + R > N

W = minimum nodes that must confirm a write
R = minimum nodes that must respond to a read
N = total nodes

If W + R > N, the read quorum and write quorum must overlap — at least one node that responded to the read also participated in the latest write. That node has the latest value.

If W + R ≤ N, reads can return stale data. This is acceptable in some systems (Cassandra at CONSISTENCY ONE) where you're trading consistency for speed.

Where quorums appear in real systems

Quorums are not just a database concept. They appear anywhere distributed systems need to make decisions:

  • MongoDB replica sets{ w: "majority" } must be confirmed by a majority of replica set members
  • CassandraCONSISTENCY QUORUM requires a majority of replicas to respond
  • etcd / ZooKeeper / Consul — cluster operations require quorum of server nodes
  • Raft-based systems — log entries are committed once a quorum of nodes acknowledges them
  • PostgreSQL with Patroni — failover only happens when a quorum of nodes agrees the leader is gone

Interview Insight

"Why does quorum use majority and not all nodes?" Because requiring all nodes means any single failure blocks the system. A majority quorum tolerates failures while guaranteeing consistency — any two majority groups must overlap by at least one member. That overlap prevents contradictory decisions. "Why not just one node?" Because one node can't guarantee you've seen the latest write. Majority is the minimum that guarantees overlap.

Section 2 — Consensus

🔹 Consensus

Simple Explanation

Consensus is the problem of getting a group of nodes to agree on a single value or decision — even when some nodes fail or messages are lost. Once consensus is reached, the decision is final and every participating node knows the same answer.

This sounds simple. It's not. It's one of the hardest problems in distributed computing, and entire research careers have been built around it.

Why consensus is hard

In a single-machine system, agreement is trivial. In a distributed system, three things go wrong simultaneously:

Nodes fail: a node may crash mid-operation, leaving others uncertain whether it completed its part.

Messages are lost or delayed: a message from Node A to Node B might arrive late, arrive twice, or not arrive at all. You can't distinguish a slow node from a dead one.

No shared clock: nodes don't have a single source of time. "Event A happened before Event B" is surprisingly hard to establish across machines with imprecise clocks.

The FLP impossibility result (1985) formally proved that in a purely asynchronous system with even one possible node failure, consensus cannot be guaranteed. Real consensus algorithms work around this by making timing assumptions — if a node doesn't respond within a certain window, assume it's dead and proceed.

What consensus is used for

  • Leader election — all nodes agree on which node is the current leader
  • Log replication — all nodes agree on the same ordered sequence of operations
  • Distributed transactions — all nodes agree on whether a transaction committed or rolled back
  • Configuration changes — all nodes agree on current cluster membership

Mini Diagram

Problem: 5 nodes need to agree on a new leader

Node A → proposes Node 3
Node B → proposes Node 3
Node C → proposes Node 3
Node D → (crashed, no response)
Node E → proposes Node 3

4/5 agree → quorum reached → consensus: Leader = Node 3All working nodes recognise Node 3 as leader

🔹 Raft

Simple Explanation

Raft is a consensus algorithm designed to be understandable — its authors explicitly prioritised clarity over theoretical minimality. It breaks consensus into three distinct problems: leader election, log replication, and safety. Most modern distributed systems that need consensus use Raft or a Raft-derived variant.

Leader election in Raft

Every node is in one of three states: follower, candidate, or leader. All nodes start as followers.

The leader sends heartbeat messages to all followers at regular intervals. If a follower doesn't receive a heartbeat within a random election timeout (typically 150–300ms), it assumes the leader has failed, increments its term number, transitions to candidate, and starts an election.

Normal operation:
Leader → heartbeat → Follower 1Leader → heartbeat → Follower 2Leader → heartbeat → Follower 3
Leader fails → no heartbeats for 250ms
Follower 1 timeout expires → becomes Candidate
Candidate: votes for itself, sends RequestVote to others
Follower 2: votes yes ✓
Follower 3: votes yes ✓
Candidate receives majority (3/3 including itself) → becomes Leader ✓

The random election timeout prevents multiple nodes from becoming candidates simultaneously and splitting the vote. If a split vote does occur, nodes wait a new random timeout and try again.

Log replication in Raft

Once a leader is elected, all writes go through it. The leader appends the write to its log and sends it to all followers. Once a quorum of followers acknowledges the write, the leader marks it committed and applies it to the state machine.

Client → write "SET balance = 500" → Leader
Leader → appends to log [term=2, index=5, cmd="SET balance=500"]
Leader → AppendEntries RPC to all followers
Follower 1 → confirms ✓
Follower 2 → confirms ✓
Follower 3 → (slow, not yet confirmed)

Quorum met (leader + 2 followers = 3/4) → entry committed ✓
Leader → responds to client: success ✓
Follower 3 → catches up in next heartbeat cycle

The key safety property: a log entry is only committed once a quorum has it. A newly elected leader always has the most up-to-date log — a node with a stale log can't win an election, because nodes only vote for candidates whose log is at least as current as their own.

Where Raft is used

  • etcd — the key-value store Kubernetes uses for all cluster state
  • CockroachDB — every range (shard) is replicated using Raft
  • TiKV — distributed key-value store backing TiDB
  • Consul — service discovery and configuration
  • MongoDB — replica set elections use a Raft-inspired protocol
  • Kafka (KRaft mode) — replaced ZooKeeper with built-in Raft for metadata management

🔹 Paxos

Simple Explanation

Paxos is the original consensus algorithm, proposed by Leslie Lamport in 1989. Raft was explicitly designed as a more understandable alternative. Paxos is theoretically elegant but notoriously hard to implement correctly — Lamport himself noted that engineers consistently found it difficult to reason about.

The basic structure

Paxos has three roles: proposers (initiate consensus), acceptors (vote on proposals), and learners (learn the outcome). A proposal goes through two phases:

Phase 1 (Prepare):
Proposer → sends Prepare(n) to majority of acceptors
Acceptors → respond with promise not to accept proposals < n

Phase 2 (Accept):
Proposer → sends Accept(n, value) to majority
Acceptors → accept if they haven't promised to a higher-numbered proposal

If majority accepts → value is chosen
Learners → learn the chosen value

Most systems use Multi-Paxos — a variant that elects a stable leader to issue proposals, avoiding the overhead of Phase 1 for every operation. This is essentially how Raft works, which is why Raft is often described as a more understandable version of Multi-Paxos.

Where Paxos is used

  • Google Chubby — Google's distributed lock service
  • Google Spanner — uses Paxos for replication within shard groups
  • Apache ZooKeeper — uses ZAB (ZooKeeper Atomic Broadcast), a Paxos variant

Raft vs Paxos

Both solve the same problem with equivalent guarantees. Raft is easier to understand and implement. Paxos is more flexible and supports leaderless variants like EPaxos. For new systems, Raft is the default choice.

Section 3 — Distributed Transactions

When a transaction spans multiple databases or services, you can no longer rely on a single database's ACID guarantees. You need a protocol to coordinate commits across participants. This is where 2PC, 3PC, and SAGA come in.

🔹 Two-Phase Commit (2PC)

Simple Explanation

Two-Phase Commit is a protocol for atomically committing a transaction across multiple systems. A coordinator manages the process. In Phase 1 (Prepare), the coordinator asks every participant: "can you commit?" Each participant locks its resources and responds yes or no. In Phase 2 (Commit/Rollback), the coordinator makes the final decision: if all said yes, commit. If any said no, rollback everywhere.

All participants commit or all rollback. No partial commits.

Analogy

A wedding coordinator calls every vendor the morning of the event: "Are you ready?" Caterer: yes. Photographer: yes. Band: yes. Florist: no — their van broke down. The coordinator calls everything off. Everyone stands down. All or nothing — if any single vendor can't deliver, the whole event gets cancelled rather than proceeding incomplete.

Mini Diagram

Coordinator → "Prepare?" → Payment Service
                         → Inventory Service
                         → Shipping Service

Responses:
Payment Service   → "Yes, prepared" ✓ (locks funds)
Inventory Service → "Yes, prepared" ✓ (locks stock)
Shipping Service  → "Yes, prepared" ✓ (locks slot)

All yes → Coordinator → "Commit" → all three commit ✓

Failure path (any no):
Shipping Service  → "No"→ Coordinator → "Rollback" → all three rollback, all locks released ✓

PostgreSQL distributed transactions

PostgreSQL supports 2PC natively through PREPARE TRANSACTION. This is how distributed transaction managers coordinate commits across multiple PostgreSQL instances:

-- Connection 1 (Payment DB):
BEGIN;
UPDATE accounts SET balance = balance - 500 WHERE id = 1;
PREPARE TRANSACTION 'txn_order_12345';
-- Row is locked, waiting for coordinator decision

-- Connection 2 (Inventory DB):
BEGIN;
UPDATE stock SET reserved = reserved + 1 WHERE product_id = 99;
PREPARE TRANSACTION 'txn_order_12345';

-- Coordinator: all prepared, commit both:
COMMIT PREPARED 'txn_order_12345';  -- run on both connections

-- Or if anything failed:
ROLLBACK PREPARED 'txn_order_12345';  -- run on both connections

The blocking problem — 2PC's fatal flaw

2PC has one critical weakness: if the coordinator crashes after sending "Prepare" but before sending "Commit" or "Rollback," participants are stuck. They've prepared their local transactions and are holding locks. They can't commit on their own. They can't rollback without risking inconsistency.

They block until the coordinator recovers.

Coordinator → "Prepare?" to all → all respond "Yes"
Coordinator crashes ← HERE

All participants:
  Payment Service:   holding locks, waiting ❌
  Inventory Service: holding locks, waiting ❌
  Shipping Service:  holding locks, waiting ❌

Cannot proceed until coordinator restarts.
If coordinator disk is corrupted → stuck indefinitely.

This is the blocking problem — a failed coordinator blocks the entire transaction and holds resource locks across all participants. In microservices that make external API calls or long-running operations, this is unacceptable.

✅ Use 2PC when:

  • All participants are databases that support prepared transactions
  • Transaction duration is short (milliseconds, not seconds)
  • The coordinator itself is highly available
  • You need strict atomicity between a small number of tightly coupled systems

❌ Avoid 2PC when:

  • Participants include external services or APIs that don't support prepare/commit
  • Transaction duration is long or involves user interaction
  • You're building microservices at scale — the blocking and lock overhead kills throughput

🔹 Three-Phase Commit (3PC)

Simple Explanation

Three-Phase Commit adds an intermediate phase between prepare and commit — the pre-commit phase — to address 2PC's blocking problem. The extra step gives participants enough information to make progress without the coordinator, removing the blocking condition.

The three phases

Phase 1 (CanCommit):
Coordinator → "Can you commit?" → all participants
Participants → "Yes" or "No"

Phase 2 (PreCommit):
If all yes:
Coordinator → "Pre-commit" → all participants
Participants → acknowledge pre-commit

Phase 3 (DoCommit):
Coordinator → "Commit" → all participants
All commit ✓

The key insight: once a participant receives and acknowledges pre-commit, it knows all other participants were also ready. If the coordinator crashes now, a participant can safely commit on its own — it knows no one said no in Phase 1.

Why 3PC isn't used in practice

3PC solves the coordinator crash problem, but introduces a network partition problem. If the network splits during Phase 2, some participants received pre-commit and some didn't. Recovery algorithms on each side may make different decisions — one side commits, the other rolls back — the exact inconsistency 3PC was supposed to prevent.

Modern systems solve the coordinator crash problem differently:

  • Make the coordinator itself highly available through Raft/Paxos (so it rarely crashes)
  • Use the SAGA pattern, which avoids distributed locking entirely

3PC is important for interviews and distributed systems theory, but you won't find it in production.

🔹 The SAGA Pattern

Simple Explanation

SAGA is the practical answer to distributed transactions in microservices. Instead of locking resources across services and coordinating a global commit, SAGA breaks the transaction into a sequence of independent local transactions — one per service. Each step commits locally and publishes an event. If any step fails, the pattern runs compensating transactions — actions that undo what the completed steps did.

No coordinator. No cross-service locks. No blocking.

Analogy

Booking a package holiday. The agent books flights, then hotel, then car rental — each independently. If car rental fails, the agent cancels the hotel booking and the flights. Each cancellation is a compensating action. Nobody held a global lock on all three bookings simultaneously. Just sequential steps with defined rollback paths.

Mini Diagram — e-commerce order

Step 1: Order Service → creates order (status: pending) → publishes OrderCreated ✓
Step 2: Payment Service → charges card → publishes PaymentProcessed ✓
Step 3: Inventory Service → reserves stock → publishes StockReserved ✓
Step 4: Shipping Service → schedules delivery → Order complete ✓

Failure at step 3 (out of stock):
Step 3: Inventory Service → publishes StockUnavailable ✗

Compensating transactions (run in reverse):
← Payment Service: refunds charge → publishes PaymentRefunded ✓
← Order Service: cancels order ✓

Two SAGA implementations

Choreography-based SAGA: no central coordinator. Services communicate via events. Each service listens for events and publishes its own in response.

OrderService publishes "OrderCreated"
  → PaymentService listens, charges card, publishes "PaymentProcessed"
  → InventoryService listens, reserves stock, publishes "StockReserved"
  → ShippingService listens, schedules delivery

Failure path:
InventoryService publishes "StockUnavailable"
  → PaymentService listens, refunds card, publishes "PaymentRefunded"
  → OrderService listens, cancels order

No single point of failure. But tracking the overall transaction state is hard — it's distributed across event logs.

Orchestration-based SAGA: a central orchestrator explicitly calls each service in sequence and handles failures.

SagaOrchestrator:
  1. Call OrderService.createOrder()    → success ✓
  2. Call PaymentService.chargeCard()   → success ✓
  3. Call InventoryService.reserve()    → failure ✗

Compensate (reverse order):
  4. Call PaymentService.refund()       → success ✓
  5. Call OrderService.cancelOrder()    → success ✓

More complex to build, but easy to monitor — the orchestrator knows the exact state of every step.

SAGA vs 2PC — the core tradeoff

2PC locks resources across all participants until the global transaction completes. If a step involves a 30-second API call, all locks are held for 30 seconds — every other transaction touching those rows waits. This is why 2PC breaks down in microservices.

SAGA doesn't lock anything across services. Each step commits locally and moves on. The tradeoff: between steps, the system is in an intermediate state that other transactions can see. This is eventual consistency — the system will reach a consistent state, but not atomically.

For most microservice architectures, this is acceptable. For strict financial ledgers where intermediate states are never acceptable, 2PC or a single ACID database is the right tool.

✅ Use SAGA when:

  • Your transaction spans multiple microservices with separate databases
  • Transaction duration is long or involves external service calls
  • You need scalability and can tolerate eventual consistency
  • Every step has a meaningful compensating transaction

❌ SAGA is harder when:

  • A step has no meaningful compensation (you can't un-send an email or un-fire a notification)
  • Strict atomicity is required — intermediate states can't be visible to other transactions
  • The sequence of steps is deeply interdependent and hard to reverse safely

Section 4 — Split-Brain and Network Partitions

🔹 Network Partitions

Simple Explanation

A network partition is when a network failure splits a cluster into two or more isolated groups. Nodes within each group can communicate, but nodes in different groups can't reach each other at all. Crucially, the nodes haven't crashed — they're running fine, just cut off from each other.

This makes partitions particularly dangerous: each isolated group thinks the other group is dead, when actually both are fully operational.

The CAP choice under partition

When a partition happens, you're forced to choose:

CP (Consistency + Partition Tolerance): nodes that can't reach a quorum stop accepting writes. They sacrifice availability to prevent producing conflicting data.

AP (Availability + Partition Tolerance): every node keeps responding. They sacrifice consistency — different partitions may accept conflicting writes that need reconciliation later.

There's no universally right answer. A payment system sacrifices availability rather than risk inconsistent financial data. A social media like counter sacrifices consistency to stay available.

🔹 Split-Brain

Simple Explanation

Split-brain is the specific failure scenario where a network partition causes two nodes to both believe they're the current leader simultaneously. Both start accepting writes. When the partition heals, you have two diverged histories with conflicting data.

It's one of the most dangerous failure modes in distributed databases — it can cause data corruption, double-writes, and inconsistencies that are hard to reconcile automatically.

Analogy

A company with offices in Dhaka and London loses its intercontinental communication link. Both offices assume the other is unreachable. Both promote their own local director to "Global CEO." Dhaka's CEO approves one acquisition. London's CEO approves a conflicting one. When communication is restored, both decisions look legitimate, but they contradict each other.

Mini Diagram

5-node cluster: [N1, N2, N3] ← partition → [N4, N5]

Left side [N1, N2, N3]:
→ Can't reach N4, N5 → assumes they're dead
→ Has quorum (3/5) → elects N1 as leader ✓ (correct)
→ N1 accepts writes

Right side [N4, N5]:
→ Can't reach N1, N2, N3 → assumes they're dead
→ Does NOT have quorum (2/5) → cannot elect leader ✓
→ Stops accepting writes (correct behaviour)

Partition heals:
→ N4, N5 rejoin as followers under N1
→ No split-brain — quorum prevented it ✓

Notice: a majority quorum prevents split-brain on its own. The minority partition can't reach quorum, so it can't elect a leader. Only the majority partition makes progress.

When quorum isn't enough — the 2-node problem

A 2-node cluster is the worst case:

[N1] ← partition → [N2]

N1: "Can't reach N2 → I must be the leader"
N2: "Can't reach N1 → I must be the leader"

Both have 1/2 nodes → no clear majority → both may elect themselves
Result: split-brain ❌

This is why production clusters never use exactly 2 nodes. 3 is the minimum. For systems that want a lightweight third vote without a full third data node, an arbiter (or witness) node participates in elections but stores no data.

🔹 Fencing

Simple Explanation

Quorum prevents split-brain most of the time, but there's a remaining edge case: what if a node believes it's the leader, then experiences a long GC pause, slow disk, or temporary network blip? Another node gets elected while it's unresponsive. When the first node comes back, it still thinks it's the leader and starts writing — even though a new leader has already been established and moved data forward.

Fencing is the mechanism that ensures a deposed leader cannot write, even if it hasn't realised it's been replaced.

STONITH (Shoot The Other Node In The Head)

The most direct approach: when a new leader is elected, the cluster issues a command to physically power off or reset the old leader before proceeding. A node you've killed cannot write.

STONITH is used in high-availability PostgreSQL clusters managed by Pacemaker + Corosync. It sounds extreme, but it's effective — physical fencing is an absolute guarantee.

Epoch-based fencing

Most modern systems use epoch numbers rather than physical fencing. Every leader is associated with an epoch number (called "term" in Raft). When a new leader is elected, it gets a higher epoch. Every write to storage includes the writer's current epoch.

The storage layer rejects writes from any node whose epoch is lower than the current epoch. A deposed leader attempting to write gets rejected — even if it still thinks it's in charge.

Epoch 1: Node 1 is leader
Network partition → Node 3 is elected → Epoch 2

Node 1 (still thinks it's leader):
→ Attempts to write with epoch=1
→ Storage: "current epoch is 2, rejected" ✗

Node 3 (actual leader):
→ Writes with epoch=2
→ Storage: "epoch matches, accepted" ✓

Raft implements this through term numbers. Any node that receives an RPC with a higher term immediately steps down to follower and updates its term. Old leaders can't persist in their incorrect belief — the term number forces them to recognise a new election happened.

PostgreSQL fencing with Patroni

Patroni prevents split-brain by storing the current leader's identity in a distributed consensus store (etcd or Consul). A node can only accept writes if it currently holds the leader key in etcd. If a node loses contact with etcd, it immediately self-fences — stepping down and rejecting writes, even without being told to by another node.

MongoDB fencing

MongoDB uses election IDs (term numbers) for fencing. A primary that gets partitioned and then reconnects sees a higher election ID from the new primary and immediately steps down. Additionally, { w: "majority" } write concern ensures writes must be confirmed by a quorum — a partitioned primary that can only reach a minority cannot satisfy majority write concern, so its writes fail automatically.

Section 5 — Putting It All Together

Quorum
└─ Foundation for everything else
└─ Majority ensures any two decisions share an overlapping node
└─ Prevents minority partitions from making authoritative decisions
└─ W + R > N guarantees reads see latest writes

Consensus (Raft / Paxos)
└─ Built on top of quorum
└─ Raft: leader election + log replication + safety
└─ Every committed entry survives any future leader election
└─ Used in etcd, CockroachDB, Kafka (KRaft), MongoDB

Distributed Transactions
├─ 2PC → atomic across multiple systems, but blocks on coordinator failure
│         good for: short transactions between databases
│         bad for:  microservices, long-running operations
├─ 3PC → fixes blocking, introduces partition problems → rarely deployed
└─ SAGA → async, compensating transactions, eventual consistency
          good for: microservices, long transactions, external services
          bad for:  operations with no meaningful compensation

Split-brain prevention
├─ Majority quorum → minority partitions can't elect a leader
├─ Consensus algorithms (Raft) → term numbers prevent stale leaders
└─ Fencing → physical (STONITH) or epoch-based → last line of defence

How These Concepts Appear in Real Systems

etcd

The distributed key-value store that Kubernetes uses for all cluster state. Uses Raft for consensus. Requires 3 or 5 nodes in production. A quorum must be available for reads and writes. Patroni uses etcd as its distributed coordinator for PostgreSQL failover — the leader key in etcd is the source of truth for who the current PostgreSQL primary is.

ZooKeeper

Apache's coordination service using ZAB (ZooKeeper Atomic Broadcast), a Paxos variant. Provides distributed locks, leader election, and configuration storage. Used by older Kafka versions, HBase, and much of the Hadoop ecosystem.

MongoDB replica sets

Raft-inspired elections with term-based fencing. { w: "majority" } write concern for quorum-based durability. Elections require a quorum of voting members — 3-node replica sets are the minimum production configuration. Split-brain is prevented by term numbers; a partitioned primary steps down when it reconnects and sees a higher term from the elected replacement.

CockroachDB

Every data range is replicated using Raft. Distributed SQL transactions use a combination of 2PC and Raft — the commit record is written to the Raft log rather than depending on a separate coordinator process. This sidesteps the blocking problem of traditional 2PC because the "coordinator" decision is itself replicated via consensus.

Cassandra

Quorum-based reads and writes with tunable consistency. No traditional consensus protocol for writes — uses gossip for cluster membership and Paxos lightweight transactions for conditional writes. Cassandra explicitly accepts eventual consistency in most configurations; split-brain resistance depends on correct quorum configuration by the operator.

Kafka (KRaft mode)

Replaced ZooKeeper with a built-in Raft implementation for managing cluster metadata. The controller quorum uses Raft for leader election and log replication. Eliminates the operational burden of running a separate ZooKeeper cluster alongside Kafka.

Microservices (SAGA)

SAGA is the standard pattern for distributed transactions across microservices. Orchestration-based SAGA is common in e-commerce (order → payment → inventory → shipping), where a central saga orchestrator coordinates steps and runs compensating transactions on failure. Event-driven choreography-based SAGA is common in event-driven architectures where services are more loosely coupled.

Conclusion

Distributed systems are hard because the failure modes are subtle, intermittent, and often only appear under specific combinations of load and bad luck. The concepts in this post are the vocabulary for understanding what's going wrong and why.

Here's what to take away:

  • Quorum is majority-based agreement. W + R > N guarantees reads see the latest write. Always use odd-numbered clusters.
  • Consensus is the hardest problem in distributed systems. Raft solves it for most production systems with an understandable, implementable algorithm.
  • 2PC provides atomic commits across systems but blocks on coordinator failure. Right for short transactions between databases; wrong for microservices.
  • 3PC fixes blocking but introduces partition problems. Theoretically interesting, rarely deployed.
  • SAGA is the practical alternative for microservices — async, compensating transactions, eventual consistency. Design your compensating transactions carefully.
  • Network partitions force a choice between consistency and availability. Neither is wrong — choose based on what your system can't afford to get wrong.
  • Split-brain happens when quorum isn't enforced. A minority partition that can't reach quorum must stop accepting writes, not promote itself to leader.
  • Fencing is the safety net after quorum. Epoch numbers prevent deposed leaders from writing even if they don't know they've been replaced.

The engineers who build reliable distributed systems aren't the ones who avoid these problems. They're the ones who understand them deeply enough to know exactly which failure mode they're dealing with at 3am.