Post

Distributed Protocol Primitives

Every distributed protocol is built from seven recurring message-exchange primitives. This catalog identifies them across MESI, Paxos, Raft, 2PC, and others — showing how the same patterns compose under different names.

Distributed Protocol Primitives

Every distributed protocol is built from a small number of recurring message-exchange patterns. These patterns appear in cache coherence, consensus, replication, locking, file systems, and transaction processing. Each protocol names them differently. An engineer who learns MESI cache coherence and an engineer who learns Paxos encounters the same building blocks under different names.

The distributed systems (DS) literature organizes by problem (consensus, broadcast, mutual exclusion) and by system model (synchronous, asynchronous, Byzantine). The layer between, how the messages actually flow within a protocol, has not been cataloged separately. This document does so.

The primitives are ordered so that each one builds only on primitives the reader already knows.


1. Request-Grant

One process asks another for something. The receiver decides and responds with a grant or denial.

1a. Single granter. One designated process decides. The simplest form.

1b. Quorum grant. The request goes to many processes. A majority must grant for the request to succeed.

On denial: the requester’s behavior depends on the protocol. It may retry immediately (MESI: the directory queues the request until the resource is available). It may retry with a higher token (Paxos: proposer picks a higher ballot and tries again, see monotonic ordering token, primitive 4). It may give up (database: transaction aborts and reports failure to the application).

ProtocolVariantRequestGranterWhat is granted
MESI1aBusRd / BusRdXDirectoryRead copy / write permission
Paxos1bPrepare(n)Acceptors (majority)Promise not to accept lower ballots
Mutex1alock()OS schedulerAccess to critical section
Raft1bRequestVoteFollowers (majority)Vote for leadership
NFS v41aOPENServerFile delegation
Database1aSELECT FOR UPDATELock managerRow lock

2. Broadcast-and-Collect

One process sends the same request to many processes, then waits for responses. This is N × request-grant (1) with a barrier that blocks until enough responses arrive.

2a. Unanimity. All must respond. One slow or crashed process blocks everyone. Same vulnerability as revoke-and-confirm (3a): the broadcaster is stuck until all respond. Protocols that use this variant either assume no crashes (MESI, hardware) or accept being a blocking protocol (2PC).

2b. Quorum. A majority must respond. Resilient: the broadcaster can proceed even if a minority of processes crash or are slow. If fewer than a majority respond, the broadcaster retries or aborts, but it is never permanently stuck by a single crash. (Uses the “quorum intersection” trick: any two quorums overlap, so information accepted by one quorum cannot be missed by the next.)

ProtocolVariantWhat is broadcastWhat is collected
MESI invalidation2a“Discard your copy”Acknowledgments
2PC Phase 12a“Can you commit?”Votes (yes/no)
Paxos Phase 12bPrepare(n)Promises + accepted values
Paxos Phase 22bAccept(n, v)Accepted confirmations
Raft AppendEntries2bLog entriesSuccess acknowledgments

3. Revoke-and-Confirm

One process tells others to give something up, then waits for confirmations before proceeding. Same message shape as broadcast-and-collect (2), but the semantics differ: broadcast-and-collect gathers information or agreement, revoke-and-confirm takes something away.

3a. Unanimity. Wait for ALL confirmations. One missing confirmation blocks the entire operation. If a process crashes or is unreachable, the revoker is stuck. Protocols handle this differently: MESI assumes the hardware does not crash (the directory waits indefinitely). 2PC has the same vulnerability (a crashed participant blocks the entire transaction, which is why 2PC is called a “blocking” protocol). Systems that cannot afford to block use quorum revocation (3b) or timeouts that treat absence as confirmation. (Uses the “scope the barrier” trick: the barrier includes only the processes that hold the resource, not all processes in the system.)

3b. Quorum. Wait for a MAJORITY. More resilient but requires reasoning about what the non-responding minority might do.

ProtocolVariantRevocation messageWho confirmsWhat is revoked
MESI3aInvalidationAll readersRead copies of a cache line
2PC3aPrepareAll participantsRight to abort unilaterally
Paxos Phase 13bPrepare(n)Majority of acceptorsAbility of lower-ballot proposers to succeed
NFS v43aCB_RECALLDelegation holderFile delegation
Ceph3aCapability revocationClients with capsRead/write capabilities

4. Monotonic Ordering Token

A strictly increasing number attached to authority. Receivers reject requests with a number lower than the highest they’ve seen.

The motivating scenario: authority transfers from process A to process B. Process A does not know it lost authority yet (lazy discovery, or network delay, or a long pause). Process A sends a request. Process B also sends a request. Both arrive at a recipient. Without a token, the recipient cannot tell which is from the current authority and which is stale. With a monotonic token, the recipient compares: higher number is current, lower number is stale and rejected.

4a. Centrally generated. One process issues the numbers. Simple, but the generator is a single point of failure.

4b. Locally generated. Each process picks its own number, high enough to supersede all previously seen. No coordination needed to generate, but conflicts are resolved by comparison. (This is the basis of the “higher number supersedes” trick.)

4c. Per-process logical. Every process maintains its own counter, merged on communication (take the maximum, then increment). Captures causal ordering without a central authority. (Uses “piggyback state on existing messages”: the counter travels on messages already being sent.)

ProtocolVariantName of the tokenWho generates itWho checks it
ZooKeeper4azxid / znode versionLeaderFollowers
Kleppmann’s fencing4aFencing tokenLock serviceStorage service
Paxos4bBallot numberProposerAcceptor
Raft4bTermCandidateAll nodes
Lamport clocks4cLogical timestampEvery processEvery process
Database4aTransaction ID / LSNTransaction managerRecovery system

5. Authority Transfer

Authority over a resource moves from one process to another. The hard problem: at no moment should two processes both believe they hold authority, and at no moment should authority be absent (unless the protocol tolerates a gap). The transfer must appear atomic to the rest of the system.

5a. Coordinated. A central process manages the transfer. A requesting process sends a request-grant (1) to the coordinator. The coordinator sends a surrender message to the old holder, waits for confirmation (which may involve flush-before-handoff (6) if the holder has dirty state), then grants authority to the requester. The gap is zero because the coordinator sequences both steps and does not grant until surrender is confirmed.

5b. Competitive, unique tokens. No coordinator. Each process generates a globally unique monotonic ordering token (4) using the “local value + global identity” trick (e.g., (round, process_id) with lexicographic comparison), so no two processes ever generate the same token. A process that wants authority picks the next token higher than any it has seen and broadcasts its claim (broadcast-and-collect, 2b). Each recipient compares the incoming token with the highest it has previously seen. If the new token is higher, the recipient grants support. If not, it rejects. If the claimant collects enough support (a quorum), it holds authority. There is no explicit surrender step. Authority shifts because the rest of the system stops recognizing the old token (“higher number supersedes” trick). The old holder discovers it lost authority when its next request is rejected. Ties are impossible by construction. (See Appendix A for a step-by-step derivation.)

5c. Competitive, with split vote. Same as 5b but tokens are not globally unique (the “local value + global identity” trick is not used). Two processes may pick the same token. Each recipient votes for at most one claimant per token value (first come, first served). If neither claimant collects a quorum, the round fails: a split vote. Both claimants wait a random duration and retry with a higher token. The randomness makes repeated collisions unlikely. Eventually one goes first and wins. (Raft uses this variant. Paxos avoids it by using 5b.)

ProtocolVariantWhat transfersFromToVia
MESI5aWrite permissionCurrent holderRequesterDirectory
Raft5cLeadershipOld leaderNew leaderElection (highest term wins)
Suzuki-Kasami5aTokenHolderRequesterToken message via coordinator
Paxos5bProposer roleOld proposerNew proposerHigher ballot supersedes
ZooKeeper5aEphemeral lockCrashed holderNext in queueSession expiry + watch

6. Flush-Before-Handoff

Before surrendering authority, the holder must make its private state visible. Without this, the private state is lost when authority transfers. In MESI: if the holder surrenders write permission without flushing, the main store is stale. Shared copies can later be silently evicted. If all copies are evicted while the main store is still stale, the value is gone permanently. The flush prevents this by making the main store current before anyone else gets a copy.

6a. Flush to shared store. The holder writes to a common location before surrendering. Any future reader can fetch from the shared store. (Uses two tricks: “flush enables free eviction” of copies later, and “messages establish causality for free” since the flush creates a happens-before edge between the writer and all future readers.)

6b. Flush to successor directly. The holder sends its state directly to the next authority, bypassing the shared store. (Uses “messages establish causality for free”: the direct transfer carries both data and a happens-before edge.)

ProtocolVariantWhat is flushedWhereWhy
MESI6aDirty cache lineMain store (RAM)Shared copies can be silently evicted; if main store is stale and all copies are evicted, the value is lost
MOESI6bDirty cache lineRequesting cache directlyAvoids expensive write to main store
Database6aWrite-ahead logDiskWrites must be recoverable after coordinator crash
Raft6bUncommitted log entriesFollower logsA new leader must have all committed entries

7. Silent Withdrawal

A process leaves a group without notifying anyone. The group’s tracking metadata becomes imprecise.

The alternative is explicit notification: send a message on every departure. The tradeoff: explicit notification keeps metadata accurate but costs a message every time a process leaves. Silent withdrawal saves that message at the cost of occasional imprecision, handled when discovered. In systems where departures are frequent and cheap (MESI: cache eviction happens constantly), the savings dominate.

7a. Discovered on next interaction. The protocol encounters the stale metadata and handles it gracefully. (Uses “lazy metadata” trick: the metadata is allowed to be imprecise, and staleness is resolved on discovery rather than prevented.)

7b. Discovered by timeout. Absence is detected by time passing without a heartbeat or renewal.

ProtocolVariantWho leavesWhat they heldConsequence
MESI7aController evicting a shared entryRead copyDirectory sends invalidation, controller acks immediately (nothing to discard)
TCP7bCrashed peerConnection stateTimeout detects absence
Lease systems7bExpired lease holderLeaseLease expires, coordinator reclaims
Gossip protocols7bCrashed nodeMembershipFailure detector eventually notices

Smart Tricks

The primitives alone do not explain why protocols work well. What makes protocols elegant is a set of recurring tricks that reduce messages, avoid coordination, or get guarantees for free. These tricks feel clever the first time you encounter them. They appear across unrelated protocols because the same constraints produce the same solutions.

Higher number supersedes. No need to ask the old holder to surrender. Just show a higher number. Every receiver enforces this independently. The old holder is not contacted, not asked, not even aware it lost authority. Authority shifts because the rest of the system stops recognizing the old number. Used in: Paxos ballots, Raft terms, fencing tokens. This is what makes competitive authority transfer (5b) possible without a coordinator.

Quorum intersection. Any two majorities share at least one member. So if a value was accepted by a majority, any future majority must include someone who knows about it. No process needs to talk to every other process. A majority is enough because any two majorities overlap. Used in: Paxos (both phases), Raft (election and replication), quorum reads/writes in distributed databases.

Lazy metadata. Don’t keep metadata perfectly accurate. Let it go stale. Handle the staleness gracefully when discovered. In MESI: the directory’s reader set lists a controller that already evicted its copy. When the directory sends an invalidation, the controller has nothing to discard and acknowledges immediately. The protocol works. The cost of perfect metadata (a message on every eviction) is higher than the cost of occasional wasted messages. Used in: MESI silent eviction, gossip protocol membership, lease-based systems where the coordinator discovers expiry lazily.

Flush enables free eviction. Silent withdrawal (7) is safe only because flush-before-handoff (6) happened earlier. The main store was made current, so any copy can vanish without loss. The expensive operation (flush) happens once, enabling the cheap operation (silent discard) to happen many times. One payment, many benefits. Used in: MESI (flush to main store enables silent eviction of shared copies), database checkpointing (flush enables log truncation).

Piggyback state on existing messages. Instead of sending separate tracking messages, attach your current knowledge to messages you are already sending. No extra round trips; the information travels for free. Lamport clocks: every message carries the sender’s logical timestamp, so the receiver inherits causal history without requesting it. Paxos: promises carry the highest accepted value, so the proposer learns the system’s state from the same messages that grant it permission. Used in: Lamport clocks, vector clocks, Paxos promise responses, MESI snoop responses that carry the cache line data along with the state change.

Messages establish causality for free. A protocol message that transfers data simultaneously establishes a Lamport happens-before edge between sender and receiver. The flush in MESI is not just “make the main store current.” It also establishes that everything the writer did before the flush causally precedes everything any future reader of that value does. The protocol does not need a separate mechanism for causal ordering; the message exchanges that enforce the coherence constraint provide it as a side effect. This is why Release/Acquire atomics work: Release is a flush, Acquire is a read from the flushed location, and the happens-before edge comes for free from the message the hardware already sent.

Local value + global identity = globally unique without coordination. A local value (counter, timestamp, round number) is unique within one process but can collide across processes. Append the process identity. Now it is globally unique. But uniqueness alone is not enough: every participant must be able to compare any two tokens and get the same answer. This requires a total ordering rule.

Two implementations. Lexicographic comparison: the token is a tuple (round, process_id). Compare round first; higher round wins. If rounds are equal, compare process ID; higher ID wins. Every participant applies the same two-step comparison and reaches the same conclusion. This is what Paxos uses. Modular arithmetic: flatten the tuple into a single number. Process i out of N processes only uses numbers where number mod N == i. Process 0 uses 0, 3, 6, 9… Process 1 uses 1, 4, 7, 10… The process identity is baked into the number. Comparison is plain integer comparison. Simpler but wastes number space.

Either way, any two tokens from different processes are always different and every participant computes the same ordering without coordination.

Lamport 1978 used this to turn a partial order into a total order: logical timestamps can tie, so append the process ID and break ties deterministically. Paxos ballots use the same structure. Twitter’s Snowflake IDs apply it at scale: (timestamp, machine_id, sequence), producing globally unique sortable IDs across thousands of machines with zero coordination. Database sharded keys, UUIDs, and distributed sequence generators all use the same trick.

Used in: Lamport total order (1978), Paxos ballot numbers, Raft (does NOT use this trick, allowing ties which it handles with random backoff and retry), Snowflake IDs, sharded database keys.

Scope the barrier. Revoke-and-confirm (3a) waits for N acknowledgments. But N is not all processes in the system. It is only the processes in the reader set for that specific entry. The directory scopes the barrier to exactly the processes that matter. Every other process in the system is unaffected. This is why MESI scales: a write to an uncontended entry involves zero messages, because the barrier has zero participants.


Conservation of Difficulty

A distributed protocol enforces a constraint on distributed state. The difficulty of enforcing it cannot be eliminated. It can only be moved.

Every primitive has variants. The variants are not ranked: none is strictly better than another. They differ in where the difficulty lives.

Where difficulty goesWhat it costsWho pays
A coordinatorSingle point of failure, bottleneckSystem reliability
Message countMore rounds, more bandwidthNetwork and latency
TimeWasted rounds, retriesProgress speed
Design complexityClever token schemes, quorum arithmeticProtocol designer
AssumptionsSimpler protocol, but only correct if assumptions holdWhoever guarantees the assumptions

Assumptions are the most powerful way to move difficulty. MESI uses all the simplest variants because the hardware guarantees no crashes (silicon doesn’t stop mid-protocol), no message loss (the bus/interconnect is reliable), and bounded latency (electrical signals have predictable timing). Those guarantees aren’t free. They are paid for by the hardware engineers who designed the silicon. The difficulty moved to them. The protocol is simple because someone else absorbed the hard parts.

Paxos uses harder variants because it assumes crashes happen, messages can be arbitrarily delayed, and only a majority survives. It cannot rely on anyone else’s guarantees. The difficulty stays in the protocol itself. DS theory calls these choices “system models” (crash-stop, crash-recovery, Byzantine, synchronous, asynchronous, partially synchronous). Each model is a set of assumptions. Choosing a stronger model (fewer failures possible) simplifies the protocol. Choosing a weaker model (more failures possible) forces the protocol to handle more, using harder variants of the same primitives.

Authority transfer illustrates this clearly:

VariantWhat you payWhere difficulty livesProgress
5a CoordinatedSingle point of failure, bottleneckCoordinator reliabilityDeterministic
5b Unique tokensToken design, more messagesToken generation + quorum logicDeterministic
5c Split voteWasted rounds on tiesRetry + random backoffProbabilistic

No variant has zero cost. If one were strictly better, the others would not exist. They exist because different systems have different answers to “where can I afford to put the difficulty?”

MESI uses all the simplest variants (all “a” variants: single granter, unanimity, coordinated transfer, flush to shared store) because the hardware can afford a coordinator. The directory is built into silicon. It does not crash. It does not partition. The difficulty is placed in a coordinator that the hardware guarantees is reliable.

Paxos uses harder variants (quorum grant, competitive transfer, quorum collect) because its coordinator CAN crash. It cannot afford to place the difficulty in a single reliable process, so it distributes the difficulty across quorums. The cost shows up as more messages and more design complexity.

The impossibility results in distributed systems (FLP, 1985) are the formal proof that the difficulty is conserved: no asynchronous protocol can guarantee agreement, termination, and fault tolerance simultaneously. One must be weakened. The choice of which one to weaken IS the choice of where to put the difficulty.


Relations

The seven primitives are not all independent. Some are compositions of others, some are sequential dependencies, and some are degenerate cases.

Atomic primitives. Request-grant (1) is the elementary message exchange: one request, one response. Monotonic ordering token (4) is not a message exchange but metadata attached to other primitives, giving receivers a way to reject stale requests. These two are the true atoms.

Shared structure. Broadcast-and-collect (2) and revoke-and-confirm (3) are the same message shape: send to N targets, wait for a threshold of responses. The difference is semantic. Revoke-and-confirm asks targets to give something up. Broadcast-and-collect asks targets for information or agreement. Both decompose into: N parallel request-grants (1) plus a barrier that blocks until enough responses arrive.

Sequential dependency. Flush-before-handoff (6) is a precondition on authority transfer (5). When the current holder has dirty state, the flush must complete before the transfer can proceed. The ordering is strict: 6 then 5, never reversed.

Degenerate case. Silent withdrawal (7) is what happens when a process skips its role in revoke-and-confirm (3). Instead of waiting for a revocation message and sending a confirmation, the process simply disappears. The protocol must handle discovering the missing participant later, typically by treating a revocation to an absent process as an immediate acknowledgment.


Composition

Every protocol composes these primitives.

MESI cache coherence uses: request-grant (1a), broadcast-and-collect (2a), revoke-and-confirm (3a), authority transfer (5a), flush-before-handoff (6a), silent withdrawal (7a). All simplest variants.

Paxos uses: request-grant (1b), broadcast-and-collect (2b), authority transfer (5b), monotonic ordering token (4b).

Two-phase commit uses: broadcast-and-collect (2a), revoke-and-confirm (3a).

Raft uses: request-grant (1b), broadcast-and-collect (2b), authority transfer (5c), flush-before-handoff (6b), monotonic ordering token (4b).

Distributed file locking (NFS v4) uses: request-grant (1a), authority transfer (5a), revoke-and-confirm (3a), silent withdrawal (7b).

The protocols look different because they use different names for the same primitives, solve different high-level problems, and operate under different failure models. The message-exchange patterns underneath are the same seven.


Note

Implementation-level distributed systems patterns have been cataloged, notably by Unmesh Joshi in Patterns of Distributed Systems (Addison-Wesley, 2023), which describes patterns like Write-Ahead Log, Generation Clock, Leader and Followers, and Consistent Core. These describe what a system component does.

This catalog operates one level below: the message-exchange shapes that appear inside those solutions. Joshi’s “Generation Clock” uses what we call a monotonic ordering token (primitive 4). His “Leader and Followers” composes authority transfer (5) with request-grant (1). The implementation patterns are built from these primitives, but the primitives themselves have not been separately cataloged.

The distributed systems literature organizes knowledge by problem (consensus, broadcast, mutual exclusion) and by system model (synchronous, asynchronous, crash-fault, Byzantine). This catalog addresses the layer between: the recurring message-exchange patterns that compose to build every protocol.


Appendix A: Deriving Competitive Authority Transfer

How would you design authority transfer without a coordinator? Walk through it step by step.

Step 1: The intuition. If everyone agrees on who has the highest number, that process is the authority. Simple. Numbers are totally ordered. Comparison is deterministic. Every participant independently reaches the same conclusion about who wins. No coordination needed for the decision itself.

Step 2: Problem: two processes pick the same number. Two solutions exist. Make numbers globally unique by combining the round number with the process identity (“local value + global identity” trick), so ties are structurally impossible (Paxos: (round, process_id), every pair of tokens has a deterministic winner). Or allow ties and handle them with retry (Raft: each participant votes for at most one claimant per number, if nobody gets a quorum, both retry with a random delay).

Step 3: Problem: concurrent claims. Two processes broadcast claims at the same time, with different numbers. Each recipient sees both. The higher one wins everywhere, because comparison is local and deterministic. The lower claimant fails to collect a quorum because the recipients already committed to the higher number.

Step 4: Problem: the old authority doesn’t know it lost. It is not informed. It finds out lazily, when its next request is rejected because its token is stale. This is safe because every other participant already enforces the higher number. The old authority’s actions are rejected by the rest of the system even before it knows it lost. Safety does not depend on the old holder stopping promptly.

Every step is a problem and a trick that solves it. The core insight: comparison is local and deterministic, so consensus on who holds authority emerges from each participant independently applying the same rule.

This post is licensed under CC BY 4.0 by the author.